Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
figure out whether kind introduction and jim up a muscle secede yeah my
00:00:05
that's my over okay for my remote within it is on so i'm really excited excited to be here today
00:00:12
and my talk maybe about recent advances in weekly for bob wise and robustness and on
00:00:17
so before starting talk let me briefly introduce myself so i suppose i'm videos
00:00:22
so i'm going to direct that we can send off for advancing basis project
00:00:27
so this is actually a penny on us and our project in japan started in two thousand sixteen and we're working on like fundamentals
00:00:33
of a i also because andre i and it takes away exactly big sent than we have like a more than three hundred people
00:00:41
then at the same time i'm probably thought winners of tokyo and teaching computer science and picking up students
00:00:47
and also might not be that getting quite beat now i have about fifty students now optimism logo
00:00:52
then i'm also doing some part time comes out on top several local start out like a case
00:00:57
on a realtor or an advertisement when it's and things like that i'm enjoying being these kind of things
00:01:04
okay then so about my talk today so no there's machine not inform big label
00:01:09
data is really successful like speech recognition
00:01:13
misunderstanding not a new translation recommendation et cetera
00:01:18
but still there are various opinions and domains ram assimilated data is not really available
00:01:23
for example up our weekends into we're interested in medical results
00:01:27
well disaster are without robotics great signal noises things like that
00:01:32
so in those abbreviations always human is involved in the data collection process and we can when you have just number put up points
00:01:39
so the target of my talk today is that that that that kind of just
00:01:43
and we only have a limited in information must do we want to use most nine
00:01:48
but my main point is that it's not about like classification from small pay pal
00:01:52
'cause lining from small data is generally not really possible statistically unless we have strong prior knowledge coming up
00:02:00
instead because there are some week data but we assume that we have many of them
00:02:05
so something to n. is the number bit appoint and it should be last but the question is what these
00:02:11
and you don't label data on maybe they are some about week data for that the main part main point today
00:02:19
and the target problem with the simplest i just 'cause that binary for buys cars get some problem
00:02:24
so we have the point and black bullet point and but point and we we just want to separate them
00:02:29
so we know that allows amount of labour data is better classification accuracy
00:02:33
and its mission era of that somebody is in that would open the basket of
00:02:38
and we're in the the number of david at a point that the standard without
00:02:44
but as i said so in gathering labour day day sometimes custody or something the possible
00:02:49
so then it goes the uh the complete opposite actually mobiles it lining from on
00:02:53
the only from i'm a bit data that we only have like points like this
00:02:58
and they are not david yet and hopefully we can correct
00:03:00
and a bit data quite cheaply been typically unsupervised classification is
00:03:05
this clustering so we want to separate these point we just separated pointing to like one class than the other class though
00:03:13
and this also provides clustering works quite well as a surprise
00:03:17
'cause guesstimate thought if one class that correspond to one class
00:03:22
like all public class does a blue and all bottom right cluster the red
00:03:27
or something like that then clattering rocks topic here the trust get some without
00:03:31
but in in reality maybe it's not that simple so maybe clustering
00:03:35
without is not really useful possible but it was just him purposes
00:03:39
then people have studied semi supervised classification for long yes so they want to utilise
00:03:44
allows number and maybe they thought in addition to a small number of labour samples
00:03:49
but still that way they solve the problem is somewhat similar to um supplements classification
00:03:55
basically they do some kind of clustering either implicitly
00:03:58
or explicitly but now we have additional labour point here
00:04:02
so then we down but could propagate this blue colour over this class stuff and this what colour about this kind of stuff
00:04:09
but again we have the same problem so if one cluster correspond to one class then it works well but otherwise it doesn't
00:04:16
and this method works really well for him is bit of that but for a diary about the the third of what really doesn't work quite well
00:04:24
so the summaries of eyes like this so i took the classification
00:04:27
accuracy on the horizontal axis and the living cost on the but collapses
00:04:31
that's about my son running is the most activate but it costs quite heights is located on the top right corner
00:04:38
then as above its classification it later in cost this minimum but
00:04:42
cost guess not receipts usually low it's located in the bottom left corner
00:04:47
it's a nice about mistrust isn't maybe located somewhere here so we
00:04:50
wanted it to be much better but the reality doesn't work so well
00:04:54
so then we get you have open space yeah so this is our target remote
00:04:59
to have crossed yeah some of those that have high accuracy with low labour in cost
00:05:03
so that the target today and we have been working on this topic for
00:05:07
the last five years or so and hopefully i can provide you some ideas today
00:05:13
so then this is the you stop me thought i want to introduce today so we have a lot
00:05:17
of acronyms p. and you things like that the pain is posted in is needed to use and a bit
00:05:23
the calls is confidence it's it's similar in companies complimentary
00:05:28
we have a lot of like me could data and we want to learn from those be could be yeah
00:05:34
and if you have any questions you can adjust interrupt me and ask questions then i start
00:05:39
from p. you just get some sort of positive on on the the data positions like these
00:05:46
stay the problem is the same so i want just separate blue points and bad points
00:05:51
but in this year's on we only have blue point and black points and a bit points
00:05:56
and we don't have any single negative points but still the goal is to separate button right
00:06:02
this kind of justin happens for example in quick prediction like we want to uh predict with uh i
00:06:09
use i want to uh click this link or a click this link on up like this link uh no
00:06:14
then post if they that can be easily corrected by looking at the history of clicks they all positive things but
00:06:21
locally links actually quite difficult to handle we want to regard known
00:06:25
click the links as like i'm fabled clicks but not necessarily true
00:06:30
because sometime user likes those links but they didn't have enough time to click those links
00:06:35
so then basically and pretty things are and a bit data they can be either up was the point and at that point
00:06:42
so then it back in that kind of scenario we naturally have only blue point was the point and
00:06:47
a bit points but we want to the bright blue on it so that the packet problems yet today
00:06:54
so then so to solve this problem redefine the risk of a classifier in the standard way
00:07:01
so because that's happened last function l. it can be like hinges also scared also whatever
00:07:06
then we're point wise most yeah and this last is expected over our whole the powerpoint and from p. of x. and y.
00:07:13
so that i can stand out the things and no questions and these cotton or just an arrow
00:07:18
and we typically decompose this list in to like post p. part unelected pop
00:07:24
then so in the standard like it's about was one we have both positive they've done it to the data
00:07:29
then this with four posed to the data is estimated from positive data then dismiss component
00:07:34
of the data is estimated from it so that stand out and take a risk in madison
00:07:39
but in this car and he you running setup we don't up negative data so we can
00:07:43
directly estimate the style for this is that you can go change we need to go here
00:07:49
and one thing i have to expend yeah it's like i is
00:07:53
the clasp class by yeah probably the supply of like of plus one
00:07:57
so this is unknown in the eighty and we need the estimated from data
00:08:02
and but already we have several nice methods here and they they they walk only from was the one and a bit data
00:08:09
so to make the stories input today i assume that by is already known
00:08:14
but in reality it's a given p. u. data we just made by and we paid despite with an estimated kentucky
00:08:22
but today i just us implies on my snow so then we just talk with the second time with for native data
00:08:30
the mccabe we have a very simple trick to estimate the the second time
00:08:36
'cause now we know that and maybe the data is a mixture of positive and negative beta
00:08:42
there's a suppose on their data density something like this and black one
00:08:46
then we have close to the data distribution like this and then at the
00:08:49
data distribution like this and we have class proportions by and one minus by
00:08:54
then so we we not try to have this kind of mixing equation
00:08:59
but now we have and maybe the data so that are coming from d. and we have posted it are coming from this was depends the
00:09:06
then it's straight far that this time can be estimated by moving this
00:09:11
forced them to the left hand side so that only trick we use here
00:09:16
so most to scully so given that risk function we decompose dissing depose the pop and happy pop
00:09:22
and this expect isn't over negative data canopy hundred directly but based on this
00:09:28
equation coming from the previous space we just replaced this part with these two towns
00:09:34
then in then we have the expression the risk like this so fast times did opposed
00:09:39
to the data base and second part as second it's out on correspond a minute to risk
00:09:46
then uh if either if you look at the deceit isn't we only have expect there's
00:09:49
an overall pose the point here and here and expect there's an over our neighbour that happens
00:09:55
so then we can just simply used and bigger risk analysis and from only positive one and a bit data
00:10:02
it is the story p. l. so far so then it's basically done so we don't know solve the problem
00:10:08
then yeah yeah yeah yeah i i saw a it can be anything so
00:10:15
we just assume that a bit appoints ah i. d. from some two distribution
00:10:20
no no no it on on we just lay doesn't have to be so we just need samples
00:10:28
and just expect there's an is replaced by sample rates without having any permitted hope form behind
00:10:35
so that that that i could best estimate are the the mean
00:10:41
okay so so so so actually i didn't assume anything about the samples are coming from some distribution i agree
00:10:48
and also i didn't have any assumption on the l. mo the egg and i i will talk about the bit later but if can be
00:10:54
and it it can be either need near model or c. n. n.
00:10:58
i. s. d. n. whatever phone or something on the on the more though
00:11:04
then suppose model is i mean yeah funk some minimal though then we can have a simple like estimation error problems like it's
00:11:11
so i don't really going to the detail but the important point is we have one low bass scared scared of and be here
00:11:18
and one of oscar the and you hear so and b. and and and you are the number of data points from the and you
00:11:24
so then so we know that one the most care route that scare them and is this really the best although
00:11:30
we cannot see so then this isn't isn't it opens
00:11:32
already means that this simple and bigger risk madison approaches optimal
00:11:39
and once we perform the same analysis for the in learning
00:11:43
standards about was lining then we have a similar problem like this
00:11:47
and again we have one of moscow them and be here and one of us go with them and and here again it is optimal
00:11:54
pencil p. you're running and also p. enlightening achieve the of michael wasn't straight from this analysis
00:12:00
but now do we can compare these two bonds because i do we have the common concerns here
00:12:06
then so this idea quite interesting because we started this spiel on results other kind of compromise
00:12:12
because we we can't collect native data for is that that is unable data so that was a compromise
00:12:18
but by comparing these two buttons do we can so that under some condition
00:12:23
you running can be better than p. and on this uh this p. line can be better than started standards of about one
00:12:30
the uh the condition is really like this i just compare these two pounds
00:12:35
if you have a last number because the points then we've done side tends to be small and
00:12:40
if you have a large number and a bit are also the fun side than to be small
00:12:44
then if we have only a small number of the of the points then right on site and to become a lot
00:12:50
so that means if you only have a small number of native data then is that the relying on these little
00:12:56
small number one maybe a a small number of nudity points using allows normal and they will point is actually better
00:13:02
so this is the kinds that suited for finding and i i think it's quite
00:13:05
interesting and contacted and i will do so later about this really happens in reality
00:13:12
okay that's one thing so we kind of so votes of the p. lining problem
00:13:17
but unfortunately the the naive method doesn't really work will so i i i would like to explain the reason why
00:13:23
so what we have done so far was we decompose the cost there's always going to be part and in part
00:13:30
and in part was further decompose into these two towns based on
00:13:33
like this mixing equation but then at the problem this minus tom
00:13:39
so suppose now last function is no negative then by definition this risk for p.
00:13:44
data is normally the tape and also this disk formatted data is none of it
00:13:48
but now second time is decomposing did beat do and we have minus down here
00:13:53
and and by definition these two times us to know negative this in x. t. minus two or something like that
00:14:00
but now who in reality we made each time based on sample rates
00:14:06
then originally it was so we might have to but nasty is estimated from data on two is
00:14:11
also estimated from data and sometimes it becomes like to minus four or something then it can become negative
00:14:19
so this is actually quite problematic and we found out this nifty problem really happens if we
00:14:24
use a flexible model like deep deep neural network so let's see and you make an example here
00:14:30
so we use just stand out i mean is there something here and we did back prop iteration
00:14:36
i think if we use like positive and negative data in in the standard way then the
00:14:41
training error on jobs like this this the dotted line so ovoid there isn't and it composes nicely
00:14:47
then at the same time to steer also decreases nice d. like this and it combines is a reasonable number
00:14:53
but if he is p. lining then the training ever drops like these
00:14:58
and at some point it becomes negative and it really goes down like this
00:15:03
then interestingly so if you see the test era it start to oversee like this
00:15:09
so this maybe understandable because training or are or should be no negative but because of
00:15:16
this new at the beach to so we have a last minute the body like this
00:15:20
but this is a clear in indication overheating and we wanted to somehow avoid this issue
00:15:26
then we decided to constrain basically the sample puts me some time to be no negative
00:15:31
make these so this was always know tar and we just took max zero
00:15:35
someone's not those two times become the that we we just rounded up to zero
00:15:40
so i will yeah yeah ah so just as a c. yeah
00:15:52
yeah
00:15:57
no i do we asked is okay is separating and just fast um is you know for stuff i just hundred as it is
00:16:04
the second time and thought them not to get off for risk for native data
00:16:11
okay so that okay these are just so time not there's some maybe i'd i didn't really defined
00:16:16
it carefully but they just these two together the next question is just impressed by the sample rates
00:16:24
so just a social time notation here so then okay was
00:16:28
the indigo time becomes negative we just round it up to zero
00:16:33
uh_huh uh_huh yeah yeah for for each meeting but we we do this so i go oh uh_huh
00:16:45
yeah so then okay it originally we justified our
00:16:48
estimate are as the uh because it is i'm biased
00:16:53
and we have this kind of bomb but now because of this marks zero
00:16:56
or trick it's biased so then we need to somehow we justify our estimate though
00:17:02
so then i i don't really go into the details yeah but we had some
00:17:05
nice third colour nicest like about consistency and biased decreases exponentially oscar they are exactly
00:17:10
minimum uh it's smaller and also the non model it's as well as optimal combatants
00:17:16
way that's something that we can somehow justify the use of this marks zero three
00:17:22
then so let's see some experimental results this is really a to remote probably improved
00:17:27
so okay we use some some by binary is wasn't see frightened that does that
00:17:32
then for t. n. t. d. n. line in blue once again the training error decreases
00:17:37
like this and it combines to zero nicely dentist there also drops and combined business directories
00:17:45
then the you running without any much this on again no goes to negative like
00:17:50
this training or are so then this there are increases like these so it didn't work
00:17:56
but if you have that max zero trick then the training a route degrees is like this and get to know negative like this
00:18:02
so then that this there are is also keep little are like these and
00:18:06
in the end so this number exactly smaller than the p. and this there are
00:18:11
so this thing is like a bit out of so because we chose the number that a point to satisfy busy this inequality
00:18:17
we have only a small number of ah small number of needed to be done last number one david data
00:18:23
but then so clearly so in that station we don't like improve union morning
00:18:29
pinning standards about what happened soapy you run can be better than p. and so that's that's an interesting finding here
00:18:38
okay so this is summary of p. u. running so it is a bit complicated but in
00:18:43
the end what we have done was quite simple we are basically separating brew point and black points
00:18:48
so the idea is to block points are mixed self but points and blue points right
00:18:53
so we we got black point i'm labelled data as noisy native data
00:18:59
and then we so bright blue and ah but must meet sprite been brought back
00:19:05
the then actually it's biased our this somebody it's located somewhere yeah that doesn't work
00:19:11
then to make things i'm biased active we are changing the last function so as far as i said
00:19:17
we had some money prisons yeah and we don't have to consider some compost loss function by combining these two
00:19:24
so then essentially so we defined last function l. two that are in in this way
00:19:30
then we actually using it either for was the beta blue points and a four and a bit data
00:19:37
then by using this like defined was functions we've done like systematically eliminate the bias caused
00:19:43
by this user are then we can is the best of what michael moore that's right
00:19:49
then it was function satisfies this symmetry come to some uh for example certified by the rump last
00:19:55
then so it did uh essentially becomes the same must have then we uh as as a using the same laws function for p. and you
00:20:04
well if it dies any of function then doubt my there's some problem becomes context
00:20:09
and it did that comes conga uh quite a bit becomes an yeah if it is a squared loss function on a
00:20:15
logistic laws function or double installs so we need to hinges
00:20:20
yeah and yeah so these laws function sucks like this concern
00:20:25
they finally we have zoned out for beep nets rounding up
00:20:28
the empirical frozen that there ought to zero really improve the performance
00:20:33
so that was the summary of the eucharist isn't so far and do you have any questions
00:20:38
oh yeah it is
00:20:46
it's a uh_huh
00:20:59
mm like in the standard case we're just estimating that this
00:21:02
expectation by sample rates and this extremism by some power it
00:21:06
hence the than that with these always kept automatically so then
00:21:10
we just it's standard applied in this and to minimise their uh
00:21:13
but in our case we decompose the second time in to like a minus p. because of this minus sign
00:21:20
some some strange things that happen i
00:21:29
uh_huh sure
00:21:35
yeah that everybody does some these two times are still in the negative if if you if you
00:21:40
have infinitely many data points but because of this small a fine example effect it can be interactive
00:21:47
and and if you use a fictional model is almost always determining whether it is you should yeah
00:21:57
uh i essentially that we can just we just discount them as small as possible
00:22:02
so that the problem but by just max zero uh to still to be done but somehow prevent that effect
00:22:11
yeah that was the you lining now we can solve across give some problem only from posted on monday the data
00:22:18
and the second start off weeklies about was one frame we have
00:22:21
a lot of different methods and that kind of variation of you run
00:22:25
for then let's start from p. in new london so this is simple to pay a new
00:22:29
lining is basically finish the political schism we have all posted an elected i maybe the data
00:22:37
so then so do we have blue points right point and black points and
00:22:40
our idea is to decompose this problem into the u. p. n. and and you
00:22:47
because now we have um i thought this whole new line problem and
00:22:51
any lining essential channels be you and we can use the same method
00:22:55
then p. n. lining the standard supervised learning we we can use what about women's then
00:23:00
we basically cancel each of them separately then why don't we just add them up together
00:23:06
we have lining criteria hole you running the in running any learning just combined for that position
00:23:15
so then the next small question is so how to combine these three criteria
00:23:22
so you're may just have to type up army doesn't just don't then based on cars by this and
00:23:26
that's fine but we wanted to make it as simple as possible so we wanted to choose two or three
00:23:34
then i i i thought i naturally we may just be lining up in new learning 'cause they are can semantic
00:23:41
but we found out so based on some so to risk analysis like
00:23:45
combining the you and and you actually not not really the best communism
00:23:49
so i i don't really go into detail but the so that if p. u. is better than any of it so
00:23:55
we have some older like this and the other way like this thing if you want to choose based to let problems
00:24:01
then so the an annual liked in not the best but combining p. u. p. n.
00:24:07
or are in u. p. n. a key that this combinations at least from the third
00:24:11
the risk analysis and also in because we we come from the spam no quite clearly
00:24:17
so then okay things a bit complicated but in in the in the end what we're doing is quite straightforward
00:24:22
so we have the you lining the enlightening and you're running and you just one high top army that eat up
00:24:28
and if you eat ice minus one we just use you line and then if we eat ice in if e. that's increased then it
00:24:36
have more influence opinion mining and then if you die zero or we
00:24:40
just use the you use we we don't use and a bit data
00:24:44
then if it dies for the increased it goes to a new learning so we just you know control
00:24:50
the balance based on this e. to hyper parameter and this you guys chosen based on post like this systematically
00:24:58
so that's the final network and then we can actually so that this is
00:25:02
actually quite interesting that we can act upon the generalisation error are in this way
00:25:07
and that's the got that's the stuff that we have one little basket the annual town
00:25:12
so in use the number of unable data so this basically shows that i watch
00:25:17
there there's an arrow decreases if we increase the number one a bit at that point
00:25:23
and in this division actually we didn't have any assumption on clusters or whatever
00:25:28
we're just assuming that bit up and sometime from coming from i can't
00:25:32
can i id from some two distribution so that the standard set up
00:25:36
but we can guarantee that that there is there's an error decreases if we have milan david data
00:25:41
so to the best of oh no it's i i think this is the fast
00:25:44
without that we obtain this kind of of my order without having additional class stuff function
00:25:51
and more technically debate we used and a bit
00:25:54
data is quite different from previous sensible by signing method
00:25:58
as in the previous my thought basically and they they they
00:26:00
that useful organisation well maybe propagation like we wanted to propagate
00:26:06
maybe it's all of our class stuff that was basically a big races on the biases out that
00:26:11
but in our case we are so method is based on new running and in in
00:26:16
new lining as as it used beta it's because it does noise in it at the data
00:26:21
what a new lining you data is regarded as noisy
00:26:24
p. data and we're using i'm the data for last evaluation
00:26:30
the meaning that actually it is useful views reducing the up by yes
00:26:35
but in regular there's some it is unlimited ideas for reducing
00:26:39
the variance so the way unabated days uses active quite different
00:26:44
so we are somehow extracting label information from and a bit data
00:26:47
so that's why we are cheap one was great when you come buttons
00:26:51
even without having class stuff since then we did some paul experiments here
00:26:58
then of course it's it's not that easy to say like a new
00:27:00
method works really well but at least because abyss brown we can guarantee that
00:27:05
um i thought it's safe 'cause sometimes if you use the standard since
00:27:09
the bubbles classification without then if the class or something is not really satisfied
00:27:13
then the performance can be even lost on like or is noticeable by standing on the front small number of data points
00:27:21
but this didn't really happen in experiments quite we that's about like famous previous my thought and somehow walking quite
00:27:27
but sometimes but performance it's quite like unstable in a sense but all my thought is quite stable in this way
00:27:35
i can simply say that i'm so that's quite good but at least on with a bit safe and it is their take
00:27:40
on the best outcome here okay that was he in the
00:27:47
eucharist isn't then let's try to not reduced for the simple reason
00:27:54
we started from t. aligning but now uh uh
00:27:59
how did something wrong
00:28:11
very strange read and so i
00:28:18
i think it's frozen
00:28:37
yeah so now the next one is the confiscation prop
00:28:42
so in p. e. learning we assume that the and you data available
00:28:47
but sometimes even under the data is not available for example if
00:28:50
iraq for some company and trade on the lights and customer data
00:28:55
there we are not allowed to access the database of other companies because of some privacy issues
00:29:00
so in that case you can only access the database up alone
00:29:03
company only posted data on it a bit out from one class
00:29:08
but then so uh in conclusion so lining from like only p. data is not really possible but it's basically
00:29:13
the same as one class classification an unsupervised learning we
00:29:16
can't really trained classifier systematically but we found out if
00:29:21
was the data is you keep with confidence so i
00:29:24
describe it does how it's like to class posterior probably depending
00:29:29
then actually we can solve the problem so it's go t. cough cough as opposed to convince 'cause there's
00:29:37
so okay uh uh there's gonna be offering the same same basically to be fast defined class get some risk like this
00:29:44
and maybe like we may think if like one point
00:29:47
is eighty percent positive then that point is twenty pasta negative
00:29:52
like we may have like eighty percent she uh as opposed to the data then
00:29:56
okay twenty percent listening at the data so that the standard way of integrating data points
00:30:02
but actually we can so that this is not the right way 'cause this doesn't agree with that to risk
00:30:07
because our data is coming only from posted class i did the
00:30:13
right expression is a bit similar but it's it's like this we just
00:30:16
need a few line to drive this we have this so then it's it's well minus all the all so this is the right way
00:30:24
then we can indeed to use an bigger risk minimises an approach to obtain the criterion like this
00:30:30
then again the problem can be solved and one like the thing is we have class prior pies yeah but now this is just
00:30:37
a proposal constant and we can drop it yeah it it so we don't we don't have to estimate by reality so that that's
00:30:45
simple but it does become focus gets and then we can simply
00:30:49
showed out again succumb buttons radius one the basket of and so
00:30:53
it's optimal in some sense and we did some experiments like to
00:30:58
experiment a bit difficult yeah 'cause there's no clear baseline to compare
00:31:03
and if you compare on the top is with p. n. lining then of course spin on is the best it works this way
00:31:10
but if you compare our method is the naive waited without that i have sown wrong
00:31:16
then our method that's better than this but it was so but if we see a yeah
00:31:25
it's a yeah ah right right so uh we're assuming that out i is like
00:31:32
no negative so meaning that this deposed in class with the covers the entire domain
00:31:38
for every basically we had using importance sampling but not ready
00:31:51
yeah oh right right uh_huh
00:32:02
huh
00:32:08
uh_huh uh right right so then the experiment active we also thought that
00:32:13
kind of thing so sometimes i get a confidence is quite low then it
00:32:18
because it's some numerical instability to be a bit like i if if convinces the pointer
00:32:22
one we just increase the good examples or five or something in the performance was systematically
00:32:28
yeah in in reality right so how to define the confidence it's maybe another issue
00:32:34
like when we use crowd sourcing for example then we ask five card will
00:32:38
cost to the the the same sample then we can have like either one possum
00:32:43
say yes awful part for people say yes by that we don't have twenty percent eighty percent or something like that that's one way
00:32:49
but maybe not but is still open so depending up the case and
00:32:52
we need to uh somehow that define the confidence in in the right way
00:32:58
but that was not really explored yet so far okay that's p. confiscation
00:33:06
the next let's father go to like we we even we got
00:33:10
support isn't so that now you you um it's you plus get some
00:33:15
so let's consider completely unsupervised scenario but now i suppose we have to set up on the the data
00:33:23
so this one and this one and our assumption is not so each and they they they thought contains examples
00:33:31
to be done something that the data he and here but we assume doubt class bias the proportions are different
00:33:38
like for this left on data that does that we have maybe fifty percent positive some fifty percent net it's awesome like that
00:33:45
for this one maybe seventy percent positive and thirty percent relative or something yeah
00:33:49
and we assume that probably has a different class by other different but i could we
00:33:54
don't have to know the prayers we just not not really assume that they are different
00:33:58
'cause otherwise if you have a single i maybe they does that we can just to compose it into it
00:34:02
just that they're having the same class but yeah for this doesn't work you assume that probably has a different
00:34:08
then actually we can so that only from these two i'm a bit it
00:34:11
does it we can lie on the decision boundary we can on the bottom boundary
00:34:16
but we can really know which direction is positive or with which side is positive in it
00:34:20
that but we don't have anything goes a bit data but still we can draw a line
00:34:26
the trick here exactly i waited to p. u. and uh as i said in the beginning
00:34:31
in the u. running we we got that you data as noise in it to the data
00:34:36
but in this you you learning both acting lazy maybe we we got this one as noisy
00:34:42
plus the point we've got this isn't listening to point and just separate these two that off
00:34:48
then we can somehow reduce the by we we can cure the bias completely and we can we obtain an unbiased edition
00:34:54
and again we can achieve one above get scared of components only from to set up and a bit data
00:35:00
so this is a bit association but it's completely unsupervised but
00:35:04
we can be training the classifier without taking any single day but
00:35:09
like i put this idea like i'm thinking of using this kind and thoughtful i quit collecting
00:35:14
data from two different hospitals for example and suppose we want to predict councils or not or something
00:35:20
then we don't really have to have any data but just collect no patient
00:35:24
information and just assume that like a proportion of cancer patients are different in the
00:35:29
hospital then without having any single it we can be trying to pacify in
00:35:34
principle so that the result here how do you go with miss who had yeah
00:35:55
does jupiter yeah so i'm not really directly working on images by massive but no
00:36:01
line of uh yeah if it's eric i it should be true and other than that
00:36:07
it happens that it can be we got it i id from some distributions then
00:36:12
it should work yeah i think we must but that are very nice for this time
00:36:21
yeah so then hear the small variation its goal is to you can skip some
00:36:25
so let's go to a basic it cause you some problem like about yeah
00:36:29
income or about yeah we did on our political opinion on things like that
00:36:34
and sometimes it's not that easy to say i've got my
00:36:37
opinion is yes and no so they bring expertise to quite difficult
00:36:42
but sometimes it easy to say oh actually i have the same opinion
00:36:45
as about possum why these two people have the same opinion something like that
00:36:50
so we don't have to know yes on or let me just say like two people have the same
00:36:54
opinion like like this so these to have the same opinion is to have the same opinion manic constraint clustering
00:37:02
but the clustering is unsupervised learning but this is actually supposed to guess
00:37:07
and and then in the end we can so that only from taxing it out
00:37:11
it appeals and and maybe that appoints we can really draw a decision boundary like
00:37:16
if i gain so we don't have anything related data we can't really know which
00:37:21
side exposed and we said the negative but at least we can draw a line
00:37:25
and again with all the all follow basketball band so i could check is quite simple 'cause similar that happy as can be
00:37:34
like the decompose into and a bit data then we have acted to set the one labelled
00:37:38
data and we can describe but same technique here so that that's how we solve the problem
00:37:45
and we can extend this tissue classification into the eucharist isn't this
00:37:50
similar up yes like these two people have different opinions i like that
00:37:55
it's only from the data on your data we can again train a classifier for these uh
00:38:02
i can completely unsupervised but we can solve the problem then the next one is complimentary classification
00:38:10
so this is like a slightly different from previous story now it's
00:38:13
'cause there are not the not the class problem more than two classes
00:38:19
for example what is the robot in this teammates then we have a lot of candidates like one hundred classes here
00:38:26
then use any use coke outsourcing phone maybe in this kind of data then caught the market has to go through
00:38:31
the label from one to one hundred okay this is it was eighty three so boston dynamic subplots or something like that
00:38:38
but this lading process it's really time consuming because selecting the
00:38:41
correct class from a long list of candidates it's really not painful
00:38:47
then so we decided to use what we call complimentary label
00:38:52
the tequila wrong later so the x. and y. bar and why bar is
00:38:58
the class that the sample does not belong to like this in it is
00:39:02
not a cat this unit is not that all this is not class one
00:39:06
like that so we may need me you want to have that kind of later
00:39:11
and it's quite easy to predict that kind of complementary information was once we randomly pick one of the classes
00:39:17
then it is almost always wrong right but if you have one hundred one
00:39:21
because it's an suppose it uh it's not them out the problem the single labour
00:39:25
problem then only one class it's quite an about ninety nine no wrong and if
00:39:29
you if you just want to pick one of them and it's almost always wrong
00:39:33
then call the market just say oh it's wrong it's wrong then we can put a daisy
00:39:39
then finally it's only from company they labelled data we can be the train
00:39:43
class right yeah that might not kelly the assume that these complementary information can be
00:39:49
made their data is uniformly distributed from rome classes we have ninety nine round
00:39:54
classes and we assume that that uh we had informed distribution hobart is ninety necklaces
00:40:01
but you can also have biased distribution if we have some prior
00:40:04
knowledge but for simplicity we just assume i'm a uniform distribution here
00:40:10
for this a complimentary problem okay we can cause that to possible approaches
00:40:15
that one is method called classification from possible day but it's so this actor
00:40:19
quite nice metal but now i might to attend it causes a provided for each
00:40:24
each sample like this sample belong to either art class one or two or
00:40:28
something like that and one is well one is correct and the other is wrong
00:40:33
so then so complimentary cost give some can be regarded as an extreme case of this possibly be scenario
00:40:40
we are like all c. minus one classes uh we got it uh as as the complimentary information posse information
00:40:49
then approach to is not really cracked but we may just use my
00:40:52
today because there's some right now each sample can belong to mark the classes
00:40:58
and then we just give negative label for why bar that we know it's wrong
00:41:03
and posted maybe it's for for the rest so this is not mathematically correct but impact this we may try this method
00:41:10
but we don't really solve the problem more directly
00:41:14
then so again i don't really go into the detail here but that's gonna be asked to from the same
00:41:19
wispy fast we define the cost estimates but this time it's it's not the class of things a bit complicated
00:41:27
then so originally the expect this was taken over r. p. of x. and y. ornery david data
00:41:33
but now we want to somehow replace this one did expect isn't over the bar
00:41:38
the p. bar is an ex basin or the distribution of our complimented a bit data
00:41:44
then and there's some come disown me done expressed this all into this fall
00:41:50
then so uh just by replacing this i expect this an over sample
00:41:55
rate by sample rate is then we can perform i'm bigger risking my says
00:42:00
so that was our previous people on this directory had beat modern or
00:42:04
without them we don't really need any like at two seconds something like this
00:42:08
but we can we have a general solution uh it isn't bit complicated but
00:42:13
we don't really have a like it's straight fought solution for any loss function
00:42:19
so then again we have like one of us there than combatants so meaning that as long
00:42:23
as we have wrong labour is then our classifier really come boxes so that i think quite nice
00:42:30
then so if we compare our mommy thought was possible label my thoughts are
00:42:34
wrong motive in without and it works it to quit but still this is
00:42:39
not the end of the story 'cause now we want to utilise this postulate
00:42:43
the approach to come bought much class the living issues in the yes no dating
00:42:50
instead of asking crowd will got to label these units and
00:42:53
selecting the correct labour from a list of one hundred classes
00:42:57
rather we just randomly choose one of the label and asking more hot is it correct
00:43:03
is this soft one paper or if it's yes then it's an ornery label it immediately bowden made
00:43:09
that quite a bit so that that should be corrected then okay is this i i robot remember
00:43:16
then it's no then it is regarded as complimentary label but in the
00:43:21
previous approaches complimentary they bizarre regarded that's useless so they're just discard it
00:43:27
but based on on the thought and we can also allow from this complimentary information then
00:43:31
now we can use board and we just add like m. bigger risk of of the
00:43:37
oh there is a bigger and bigger risk for the compliment it then we just
00:43:42
add them up then we can use both information and this really improve the performance
00:43:48
so we have like ordinary cost isn't just using only a buzz
00:43:53
then so complimentary classification just use complimenting there because the both actually week
00:43:59
so if we use both of them then the performance it systematically improve
00:44:04
this should be true that's a compliment they they but also contains
00:44:06
some information but in the ordinary case they'll even what but now
00:44:11
we also know extract information from from the company that made us
00:44:15
then the performance it systematically for this is actually quite interesting result
00:44:22
okay that was looking for a markov lining from weeks above his own so i quickly even to exhibit different methods
00:44:29
but now to the point is in all my thoughts we are just estimating the the cost estimates so meaning that we
00:44:37
don't need to combine all of them if you have the data and data you data pecans they thought income for data
00:44:44
simulate the disability reconnect all we can use all of them
00:44:48
to combine all of them together and estimate debates then it's
00:44:52
in each single data point can improve the performance that's kind
00:44:56
of find the result of our running from week it's about some
00:44:59
and we decided to write a book and we said that writing and hopefully we can probably still
00:45:04
in the tent be close to have more than one one yeah so are you spend more time later
00:45:10
them i think i didn't really explain them so so what those and as i said so
00:45:15
our method is the same model agnostic organisation agnostic so it's completely also not
00:45:21
okay in so it go development we always is that in your model it
00:45:25
the simplest and we can derive some dinner doesn't thereupon and things like that
00:45:29
but in the idea when we walk with industry partners we always use them know most complex model like this net or it
00:45:36
is the n. c. n. air and things like that so i'll just a mother is completely independent of so so blinding method
00:45:43
and also this is actually coming in from a result does the point there's no
00:45:47
there's no more the people that really probably just everyday so you need to be
00:45:51
the tape archive every day to see the latest people but we also want to
00:45:56
you can basically ignore that direct on and we just focus on developing lining the thoughts
00:46:01
but when about we we have applications we find the latest paper and combine it on top
00:46:07
then we can always hoping that without okay then so in the last couple minutes uh let
00:46:14
me go through another idea on noisy label lining this is somewhat new ideas recently we're working on
00:46:20
so noise robustness is really of course important in the idea we have a lot of sense i also human error out
00:46:26
been traditionally like people use for example also buys outlay
00:46:30
addiction to get to get rid of some erroneous data
00:46:34
but of course it's about was on is not real so applying i suppose
00:46:37
i'd like dixon is not that effective what traditionally we use across the statistics
00:46:43
so is that them using the x. squared lost the use that helps a little something that well even like jen gentlest function
00:46:50
this is a standard stuff or was the statistical approach but it's not strong enough
00:46:55
if you have a last normal like like aaron this data points it doesn't work
00:47:00
well we may use realisation on what you may just made noise transition
00:47:03
matrix and things like that but they are still not really you know
00:47:08
strong in in reality so we wanted to have some nobody approaches can overcome these problems with
00:47:14
the law for the planning them another pay but we had last year was called core teaching
00:47:22
and co teaching is based on the minimisation effect of neural networks so this is complete in in in people finding
00:47:28
but many people people said the clean data are outfitted
00:47:31
frost in the planning the noisy data in any but string
00:47:36
i suppose we have like the the set consisting of some some clean
00:47:39
data on some noisy data points and they're mixed together and the minibus training
00:47:44
then the in the fast like fast in the first half of the training
00:47:48
so clean data exactly fitted area but noisy data takes more time to feed
00:47:53
then once we stop training somewhere in the middle we can somehow fit out noisy data
00:47:58
from clean data so that's the basic idea and in this court teaching frame mark we used
00:48:04
to the to the neural networks and then each neural network basically six small roles instances
00:48:10
as clean data and give it to another network and both are doing the same thing independently
00:48:18
so then it's not properly trained only from the clean data
00:48:22
each other and in the end okay the performance is really improve
00:48:26
the for some that does that so it's not the class 'cause guessing does nancy thought then so we actively fifty percent
00:48:32
above it so it's extremely noisy but still like the lines
00:48:36
our thoughts it it works really read the performance doesn't really change
00:48:40
so this was completely impossible if you use just look at look robust loss function or something like that
00:48:45
it's uh can you provide to you know get rid of noisy data from from that somebody's that that's it
00:48:53
and for the uh we had new idea to at disagreement frame up to
00:48:58
this school teaching like me just update the networks if tonight locks the chubby
00:49:04
so by that we can keep like a and b. networks as different as possible for then the
00:49:09
performances like before they improve so that's competing been so this is the last one step palm pop
00:49:18
so i i talked about p. u. lining in the beginning and we had some new that the problem and and you said so we
00:49:24
used macs zero trick to keep this down on it did but in the eighties i must say that there was like a romantic behind
00:49:31
so uh uh that trick was so so the the back prop iteration
00:49:35
mean about iteration once a good this time becomes known this done because negative
00:49:41
then actually we didn't ignore about bats that me but simply rather uh we
00:49:47
have tasted tactic ready so we took the gradient and usually go down the guardian
00:49:52
but mostly fine dot com becomes negative then we have to go
00:49:57
back but edition behind is that so what was we follow this guy
00:50:01
the and we know that there is a plot local optimum they're all
00:50:05
about getting up and they're detected in just reason exist in that direction
00:50:10
then it's nice to go back and take another mini bats then take the
00:50:14
uh no stickers they got and then we we have maybe a slightly different direction
00:50:20
then we decide decide to go this way to buy that we can somehow avoid then just local optimum
00:50:27
so that was like the the original idea was used for p. land but now we have a kinda generates promote for
00:50:32
this and we can use it for noisy data so we
00:50:36
perform reading doesn't to step back to avoid simple local optimum
00:50:41
so there's no variable for this but empirically this works amazingly well
00:50:46
so we try to somehow justified the use of this idea but so far so we didn't have any
00:50:51
method the result but hopefully we can have some just guessing future okay summary so i talked about two inches
00:51:00
so first one was big reasonable that's got to get some so we want to so lining from small data
00:51:06
needs domain knowledge and of course using domain that it's quite important but we try to be also noted that direction
00:51:13
only from like statistical information we wanted to improve the
00:51:16
performance so idea was to use many week data in an
00:51:20
unbiased manner we had a systematic and bigger risk minimisation approach
00:51:25
and it's more that i'm gonna speak undergo this unlike most
00:51:28
it's exactly quite standard framework so you can this get at any
00:51:32
like domain knowledge or any heuristics that are used in each domain
00:51:37
then at the last part i talk about robustness in non thing so we try to
00:51:41
gold going go beyond the traditional approaches about the transition robust does this go for that
00:51:47
and we we're looking at this because the line process and we used memories isn't effect of neural networks
00:51:53
well we propose to grad in step back and also the used disagreement so
00:51:58
these are quite informal discussion of this moment the performance seems to be really improving
00:52:04
so they're most light so i i as i said in the beginning so i came from we can
00:52:09
and it's uh it's name we can send off for advancing basins project and this is like a penny a national
00:52:16
project so that in japan okay we're fully supported by the
00:52:20
ministry of education in japan for reacted quite academic and basic
00:52:24
the start that like twenty sixteen and it's ten yes source that we have seven yes doable
00:52:30
then so we're interested in neck fundamentals of a i like rick machine and hope my this
00:52:35
on applied math yeah back in as number of fundamental result doesn't know wasn't that then at
00:52:40
the same time we also interested in applications but we are like basically thoughts in there we
00:52:45
can is it that's pretty basic without centre so we don't really have we'll up because some people
00:52:51
so then we decided to have partners outside like we have part not a national council sent
00:52:56
our as a partner it's like the biggest cancel without sometimes upon cancer hospital in the palm
00:53:01
well we have national material material science results and that's a partner
00:53:06
well stencil research also quite popular and we have the nobel prize winner and
00:53:09
uh so we have those particles and we are really supporting the of science project
00:53:15
oh also we're interested in using a yeah a forceful sell good like the main clear
00:53:20
exactly radius of big big problem in japan and we're really looking with the means your problem
00:53:25
well management of infrastructure is also a big issue in japan as
00:53:29
we have uh not the bridges and tunnels in japan and they are
00:53:32
but quite quite old now fifty years or so and some bridges and tunnels a bit dangerous now we try to hit the our management
00:53:40
well i'm not so these are slides also because you don't have a lot of our breaks it's not me a phone for quite
00:53:47
quite they've got the production is also extended a lot in japan
00:53:51
but we have nice like it's about computer group yeah doing by simulation
00:53:56
but we are like yeah people and we want to have more
00:54:00
say we can provide that up predicts on taking the technology to them
00:54:05
and also like we have like a source a source assigns people like yeah in society without like
00:54:11
if the guidelines apostle data management only got these years so we have acted quite the boss without us
00:54:18
and now active you have more than seven hundred result does in our son thought we'd apart hundred fifty full time people
00:54:24
and our offices located really in the centre of taco at the it's
00:54:28
like a walking distance from the centre caucuses and it's really in the middle
00:54:33
then we have a careless number of industry partners and now we have forty plus in the c. project
00:54:38
and together with fujitsu in e. c. toshiba position we
00:54:42
have so what collaboration with thoughts and pass and we
00:54:45
have actually rooms in our centre and they are like walking also you know wasn't or and we're working together
00:54:51
so we also quite open to in the sequel brazen like this
00:54:56
and also we are quite happy to have more like international corporations
00:55:00
and of course having this thing without also having like intensive didn't is
00:55:04
always very nice and i had up in india you have didn't t. uh
00:55:08
nick from with a student interesting come to japan and a lot with us for like three months or pencil we're

Share this talk: 


Conference Program

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama
May 28, 2019 · 11:04 a.m.
655 views
Q&A
Prof. Masashi Sugiyama
May 28, 2019 · 11:59 a.m.
114 views

Recommended talks

When foes are friends: adversarial examples as protective technologies
Carmela Troncoso, Assistant Professor at EPFL
June 6, 2019 · 11:09 a.m.
169 views