Player is loading...

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama

Tuesday, May 28, 2019 · 11:04 a.m. · 55m 22s

The motivation for weakly-supervised learning is to accurately perform machine learning only from "weak" data that can be collected more easily/cheaply than fully-labeled data. In the first half of this talk, I give an overview of our recently developed empirical risk minimization framework for weakly-supervised classification, covering binary classification only from PU data, PNU data, Pconf data, UU data, SU data, and Comp data (P:positive, N:negative, U:unlabeled, Conf:confidence, S:similar, and Comp:complementary). For reliable deployment of machine learning systems in the real world, various types of robustness is needed. In the latter half of this talk, I will give an overview of our recent work on robust learning towards noisy training data, changing environments, and adversarial test input. Finally, I will briefly introduce our RIKEN Center for Advanced Intelligence Project (AIP), which is a national AI project in Japan started in 2016. AIP covers a wide range of topics from generic AI research (machine learning, optimization, applied math., etc.), goal-oriented AI research (material, disaster, cancer, etc.), and AI-in-society research (ethics, data circulation, laws, etc.). Biography: Masashi Sugiyama received the PhD degree in Computer Science from Tokyo Institute of Technology, Japan in 2001. He has been Professor at the University of Tokyo since 2014 and concurrently appointed as Director of RIKEN Center for Advanced Intelligence Project in 2016. His research interests include theory, algorithms, and applications of machine learning. He (co)-authored several books such as Density Ratio Estimation in Machine Learning (Cambridge University Press, 2012), Machine Learning in Non-Stationary Environments (MIT Press, 2012), Statistical Reinforcement Learning (Chapman and Hall, 2015), and Introduction to Statistical Machine Learning (Morgan Kaufmann, 2015). He served as a Program co-chair and General co-chair of the Neural Information Processing Systems conference in 2015 and 2016, and as a Program co-chair for the AISTATS conference in 2019. Masashi Sugiyama received the Japan Society for the Promotion of Science Award and the Japan Academy Medal in 2017.

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

figure out whether kind introduction and jim up a muscle secede yeah my

00:00:05

that's my over okay for my remote within it is on so i'm really excited excited to be here today

00:00:12

and my talk maybe about recent advances in weekly for bob wise and robustness and on

00:00:17

so before starting talk let me briefly introduce myself so i suppose i'm videos

00:00:22

so i'm going to direct that we can send off for advancing basis project

00:00:27

so this is actually a penny on us and our project in japan started in two thousand sixteen and we're working on like fundamentals

00:00:33

of a i also because andre i and it takes away exactly big sent than we have like a more than three hundred people

00:00:41

then at the same time i'm probably thought winners of tokyo and teaching computer science and picking up students

00:00:47

and also might not be that getting quite beat now i have about fifty students now optimism logo

00:00:52

then i'm also doing some part time comes out on top several local start out like a case

00:00:57

on a realtor or an advertisement when it's and things like that i'm enjoying being these kind of things

00:01:04

okay then so about my talk today so no there's machine not inform big label

00:01:09

data is really successful like speech recognition

00:01:13

misunderstanding not a new translation recommendation et cetera

00:01:18

but still there are various opinions and domains ram assimilated data is not really available

00:01:23

for example up our weekends into we're interested in medical results

00:01:27

well disaster are without robotics great signal noises things like that

00:01:32

so in those abbreviations always human is involved in the data collection process and we can when you have just number put up points

00:01:39

so the target of my talk today is that that that that kind of just

00:01:43

and we only have a limited in information must do we want to use most nine

00:01:48

but my main point is that it's not about like classification from small pay pal

00:01:52

'cause lining from small data is generally not really possible statistically unless we have strong prior knowledge coming up

00:02:00

instead because there are some week data but we assume that we have many of them

00:02:05

so something to n. is the number bit appoint and it should be last but the question is what these

00:02:11

and you don't label data on maybe they are some about week data for that the main part main point today

00:02:19

and the target problem with the simplest i just 'cause that binary for buys cars get some problem

00:02:24

so we have the point and black bullet point and but point and we we just want to separate them

00:02:29

so we know that allows amount of labour data is better classification accuracy

00:02:33

and its mission era of that somebody is in that would open the basket of

00:02:38

and we're in the the number of david at a point that the standard without

00:02:44

but as i said so in gathering labour day day sometimes custody or something the possible

00:02:49

so then it goes the uh the complete opposite actually mobiles it lining from on

00:02:53

the only from i'm a bit data that we only have like points like this

00:02:58

and they are not david yet and hopefully we can correct

00:03:00

and a bit data quite cheaply been typically unsupervised classification is

00:03:05

this clustering so we want to separate these point we just separated pointing to like one class than the other class though

00:03:13

and this also provides clustering works quite well as a surprise

00:03:17

'cause guesstimate thought if one class that correspond to one class

00:03:22

like all public class does a blue and all bottom right cluster the red

00:03:27

or something like that then clattering rocks topic here the trust get some without

00:03:31

but in in reality maybe it's not that simple so maybe clustering

00:03:35

without is not really useful possible but it was just him purposes

00:03:39

then people have studied semi supervised classification for long yes so they want to utilise

00:03:44

allows number and maybe they thought in addition to a small number of labour samples

00:03:49

but still that way they solve the problem is somewhat similar to um supplements classification

00:03:55

basically they do some kind of clustering either implicitly

00:03:58

or explicitly but now we have additional labour point here

00:04:02

so then we down but could propagate this blue colour over this class stuff and this what colour about this kind of stuff

00:04:09

but again we have the same problem so if one cluster correspond to one class then it works well but otherwise it doesn't

00:04:16

and this method works really well for him is bit of that but for a diary about the the third of what really doesn't work quite well

00:04:24

so the summaries of eyes like this so i took the classification

00:04:27

accuracy on the horizontal axis and the living cost on the but collapses

00:04:31

that's about my son running is the most activate but it costs quite heights is located on the top right corner

00:04:38

then as above its classification it later in cost this minimum but

00:04:42

cost guess not receipts usually low it's located in the bottom left corner

00:04:47

it's a nice about mistrust isn't maybe located somewhere here so we

00:04:50

wanted it to be much better but the reality doesn't work so well

00:04:54

so then we get you have open space yeah so this is our target remote

00:04:59

to have crossed yeah some of those that have high accuracy with low labour in cost

00:05:03

so that the target today and we have been working on this topic for

00:05:07

the last five years or so and hopefully i can provide you some ideas today

00:05:13

so then this is the you stop me thought i want to introduce today so we have a lot

00:05:17

of acronyms p. and you things like that the pain is posted in is needed to use and a bit

00:05:23

the calls is confidence it's it's similar in companies complimentary

00:05:28

we have a lot of like me could data and we want to learn from those be could be yeah

00:05:34

and if you have any questions you can adjust interrupt me and ask questions then i start

00:05:39

from p. you just get some sort of positive on on the the data positions like these

00:05:46

stay the problem is the same so i want just separate blue points and bad points

00:05:51

but in this year's on we only have blue point and black points and a bit points

00:05:56

and we don't have any single negative points but still the goal is to separate button right

00:06:02

this kind of justin happens for example in quick prediction like we want to uh predict with uh i

00:06:09

use i want to uh click this link or a click this link on up like this link uh no

00:06:14

then post if they that can be easily corrected by looking at the history of clicks they all positive things but

00:06:21

locally links actually quite difficult to handle we want to regard known

00:06:25

click the links as like i'm fabled clicks but not necessarily true

00:06:30

because sometime user likes those links but they didn't have enough time to click those links

00:06:35

so then basically and pretty things are and a bit data they can be either up was the point and at that point

00:06:42

so then it back in that kind of scenario we naturally have only blue point was the point and

00:06:47

a bit points but we want to the bright blue on it so that the packet problems yet today

00:06:54

so then so to solve this problem redefine the risk of a classifier in the standard way

00:07:01

so because that's happened last function l. it can be like hinges also scared also whatever

00:07:06

then we're point wise most yeah and this last is expected over our whole the powerpoint and from p. of x. and y.

00:07:13

so that i can stand out the things and no questions and these cotton or just an arrow

00:07:18

and we typically decompose this list in to like post p. part unelected pop

00:07:24

then so in the standard like it's about was one we have both positive they've done it to the data

00:07:29

then this with four posed to the data is estimated from positive data then dismiss component

00:07:34

of the data is estimated from it so that stand out and take a risk in madison

00:07:39

but in this car and he you running setup we don't up negative data so we can

00:07:43

directly estimate the style for this is that you can go change we need to go here

00:07:49

and one thing i have to expend yeah it's like i is

00:07:53

the clasp class by yeah probably the supply of like of plus one

00:07:57

so this is unknown in the eighty and we need the estimated from data

00:08:02

and but already we have several nice methods here and they they they walk only from was the one and a bit data

00:08:09

so to make the stories input today i assume that by is already known

00:08:14

but in reality it's a given p. u. data we just made by and we paid despite with an estimated kentucky

00:08:22

but today i just us implies on my snow so then we just talk with the second time with for native data

00:08:30

the mccabe we have a very simple trick to estimate the the second time

00:08:36

'cause now we know that and maybe the data is a mixture of positive and negative beta

00:08:42

there's a suppose on their data density something like this and black one

00:08:46

then we have close to the data distribution like this and then at the

00:08:49

data distribution like this and we have class proportions by and one minus by

00:08:54

then so we we not try to have this kind of mixing equation

00:08:59

but now we have and maybe the data so that are coming from d. and we have posted it are coming from this was depends the

00:09:06

then it's straight far that this time can be estimated by moving this

00:09:11

forced them to the left hand side so that only trick we use here

00:09:16

so most to scully so given that risk function we decompose dissing depose the pop and happy pop

00:09:22

and this expect isn't over negative data canopy hundred directly but based on this

00:09:28

equation coming from the previous space we just replaced this part with these two towns

00:09:34

then in then we have the expression the risk like this so fast times did opposed

00:09:39

to the data base and second part as second it's out on correspond a minute to risk

00:09:46

then uh if either if you look at the deceit isn't we only have expect there's

00:09:49

an overall pose the point here and here and expect there's an over our neighbour that happens

00:09:55

so then we can just simply used and bigger risk analysis and from only positive one and a bit data

00:10:02

it is the story p. l. so far so then it's basically done so we don't know solve the problem

00:10:08

then yeah yeah yeah yeah i i saw a it can be anything so

00:10:15

we just assume that a bit appoints ah i. d. from some two distribution

00:10:20

no no no it on on we just lay doesn't have to be so we just need samples

00:10:28

and just expect there's an is replaced by sample rates without having any permitted hope form behind

00:10:35

so that that that i could best estimate are the the mean

00:10:41

okay so so so so actually i didn't assume anything about the samples are coming from some distribution i agree

00:10:48

and also i didn't have any assumption on the l. mo the egg and i i will talk about the bit later but if can be

00:10:54

and it it can be either need near model or c. n. n.

00:10:58

i. s. d. n. whatever phone or something on the on the more though

00:11:04

then suppose model is i mean yeah funk some minimal though then we can have a simple like estimation error problems like it's

00:11:11

so i don't really going to the detail but the important point is we have one low bass scared scared of and be here

00:11:18

and one of oscar the and you hear so and b. and and and you are the number of data points from the and you

00:11:24

so then so we know that one the most care route that scare them and is this really the best although

00:11:30

we cannot see so then this isn't isn't it opens

00:11:32

already means that this simple and bigger risk madison approaches optimal

00:11:39

and once we perform the same analysis for the in learning

00:11:43

standards about was lining then we have a similar problem like this

00:11:47

and again we have one of moscow them and be here and one of us go with them and and here again it is optimal

00:11:54

pencil p. you're running and also p. enlightening achieve the of michael wasn't straight from this analysis

00:12:00

but now do we can compare these two bonds because i do we have the common concerns here

00:12:06

then so this idea quite interesting because we started this spiel on results other kind of compromise

00:12:12

because we we can't collect native data for is that that is unable data so that was a compromise

00:12:18

but by comparing these two buttons do we can so that under some condition

00:12:23

you running can be better than p. and on this uh this p. line can be better than started standards of about one

00:12:30

the uh the condition is really like this i just compare these two pounds

00:12:35

if you have a last number because the points then we've done side tends to be small and

00:12:40

if you have a large number and a bit are also the fun side than to be small

00:12:44

then if we have only a small number of the of the points then right on site and to become a lot

00:12:50

so that means if you only have a small number of native data then is that the relying on these little

00:12:56

small number one maybe a a small number of nudity points using allows normal and they will point is actually better

00:13:02

so this is the kinds that suited for finding and i i think it's quite

00:13:05

interesting and contacted and i will do so later about this really happens in reality

00:13:12

okay that's one thing so we kind of so votes of the p. lining problem

00:13:17

but unfortunately the the naive method doesn't really work will so i i i would like to explain the reason why

00:13:23

so what we have done so far was we decompose the cost there's always going to be part and in part

00:13:30

and in part was further decompose into these two towns based on

00:13:33

like this mixing equation but then at the problem this minus tom

00:13:39

so suppose now last function is no negative then by definition this risk for p.

00:13:44

data is normally the tape and also this disk formatted data is none of it

00:13:48

but now second time is decomposing did beat do and we have minus down here

00:13:53

and and by definition these two times us to know negative this in x. t. minus two or something like that

00:14:00

but now who in reality we made each time based on sample rates

00:14:06

then originally it was so we might have to but nasty is estimated from data on two is

00:14:11

also estimated from data and sometimes it becomes like to minus four or something then it can become negative

00:14:19

so this is actually quite problematic and we found out this nifty problem really happens if we

00:14:24

use a flexible model like deep deep neural network so let's see and you make an example here

00:14:30

so we use just stand out i mean is there something here and we did back prop iteration

00:14:36

i think if we use like positive and negative data in in the standard way then the

00:14:41

training error on jobs like this this the dotted line so ovoid there isn't and it composes nicely

00:14:47

then at the same time to steer also decreases nice d. like this and it combines is a reasonable number

00:14:53

but if he is p. lining then the training ever drops like these

00:14:58

and at some point it becomes negative and it really goes down like this

00:15:03

then interestingly so if you see the test era it start to oversee like this

00:15:09

so this maybe understandable because training or are or should be no negative but because of

00:15:16

this new at the beach to so we have a last minute the body like this

00:15:20

but this is a clear in indication overheating and we wanted to somehow avoid this issue

00:15:26

then we decided to constrain basically the sample puts me some time to be no negative

00:15:31

make these so this was always know tar and we just took max zero

00:15:35

someone's not those two times become the that we we just rounded up to zero

00:15:40

so i will yeah yeah ah so just as a c. yeah

00:15:52

yeah

00:15:57

no i do we asked is okay is separating and just fast um is you know for stuff i just hundred as it is

00:16:04

the second time and thought them not to get off for risk for native data

00:16:11

okay so that okay these are just so time not there's some maybe i'd i didn't really defined

00:16:16

it carefully but they just these two together the next question is just impressed by the sample rates

00:16:24

so just a social time notation here so then okay was

00:16:28

the indigo time becomes negative we just round it up to zero

00:16:33

uh_huh uh_huh yeah yeah for for each meeting but we we do this so i go oh uh_huh

00:16:45

yeah so then okay it originally we justified our

00:16:48

estimate are as the uh because it is i'm biased

00:16:53

and we have this kind of bomb but now because of this marks zero

00:16:56

or trick it's biased so then we need to somehow we justify our estimate though

00:17:02

so then i i don't really go into the details yeah but we had some

00:17:05

nice third colour nicest like about consistency and biased decreases exponentially oscar they are exactly

00:17:10

minimum uh it's smaller and also the non model it's as well as optimal combatants

00:17:16

way that's something that we can somehow justify the use of this marks zero three

00:17:22

then so let's see some experimental results this is really a to remote probably improved

00:17:27

so okay we use some some by binary is wasn't see frightened that does that

00:17:32

then for t. n. t. d. n. line in blue once again the training error decreases

00:17:37

like this and it combines to zero nicely dentist there also drops and combined business directories

00:17:45

then the you running without any much this on again no goes to negative like

00:17:50

this training or are so then this there are increases like these so it didn't work

00:17:56

but if you have that max zero trick then the training a route degrees is like this and get to know negative like this

00:18:02

so then that this there are is also keep little are like these and

00:18:06

in the end so this number exactly smaller than the p. and this there are

00:18:11

so this thing is like a bit out of so because we chose the number that a point to satisfy busy this inequality

00:18:17

we have only a small number of ah small number of needed to be done last number one david data

00:18:23

but then so clearly so in that station we don't like improve union morning

00:18:29

pinning standards about what happened soapy you run can be better than p. and so that's that's an interesting finding here

00:18:38

okay so this is summary of p. u. running so it is a bit complicated but in

00:18:43

the end what we have done was quite simple we are basically separating brew point and black points

00:18:48

so the idea is to block points are mixed self but points and blue points right

00:18:53

so we we got black point i'm labelled data as noisy native data

00:18:59

and then we so bright blue and ah but must meet sprite been brought back

00:19:05

the then actually it's biased our this somebody it's located somewhere yeah that doesn't work

00:19:11

then to make things i'm biased active we are changing the last function so as far as i said

00:19:17

we had some money prisons yeah and we don't have to consider some compost loss function by combining these two

00:19:24

so then essentially so we defined last function l. two that are in in this way

00:19:30

then we actually using it either for was the beta blue points and a four and a bit data

00:19:37

then by using this like defined was functions we've done like systematically eliminate the bias caused

00:19:43

by this user are then we can is the best of what michael moore that's right

00:19:49

then it was function satisfies this symmetry come to some uh for example certified by the rump last

00:19:55

then so it did uh essentially becomes the same must have then we uh as as a using the same laws function for p. and you

00:20:04

well if it dies any of function then doubt my there's some problem becomes context

00:20:09

and it did that comes conga uh quite a bit becomes an yeah if it is a squared loss function on a

00:20:15

logistic laws function or double installs so we need to hinges

00:20:20

yeah and yeah so these laws function sucks like this concern

00:20:25

they finally we have zoned out for beep nets rounding up

00:20:28

the empirical frozen that there ought to zero really improve the performance

00:20:33

so that was the summary of the eucharist isn't so far and do you have any questions

00:20:38

oh yeah it is

00:20:46

it's a uh_huh

00:20:59

mm like in the standard case we're just estimating that this

00:21:02

expectation by sample rates and this extremism by some power it

00:21:06

hence the than that with these always kept automatically so then

00:21:10

we just it's standard applied in this and to minimise their uh

00:21:13

but in our case we decompose the second time in to like a minus p. because of this minus sign

00:21:20

some some strange things that happen i

00:21:29

uh_huh sure

00:21:35

yeah that everybody does some these two times are still in the negative if if you if you

00:21:40

have infinitely many data points but because of this small a fine example effect it can be interactive

00:21:47

and and if you use a fictional model is almost always determining whether it is you should yeah

00:21:57

uh i essentially that we can just we just discount them as small as possible

00:22:02

so that the problem but by just max zero uh to still to be done but somehow prevent that effect

00:22:11

yeah that was the you lining now we can solve across give some problem only from posted on monday the data

00:22:18

and the second start off weeklies about was one frame we have

00:22:21

a lot of different methods and that kind of variation of you run

00:22:25

for then let's start from p. in new london so this is simple to pay a new

00:22:29

lining is basically finish the political schism we have all posted an elected i maybe the data

00:22:37

so then so do we have blue points right point and black points and

00:22:40

our idea is to decompose this problem into the u. p. n. and and you

00:22:47

because now we have um i thought this whole new line problem and

00:22:51

any lining essential channels be you and we can use the same method

00:22:55

then p. n. lining the standard supervised learning we we can use what about women's then

00:23:00

we basically cancel each of them separately then why don't we just add them up together

00:23:06

we have lining criteria hole you running the in running any learning just combined for that position

00:23:15

so then the next small question is so how to combine these three criteria

00:23:22

so you're may just have to type up army doesn't just don't then based on cars by this and

00:23:26

that's fine but we wanted to make it as simple as possible so we wanted to choose two or three

00:23:34

then i i i thought i naturally we may just be lining up in new learning 'cause they are can semantic

00:23:41

but we found out so based on some so to risk analysis like

00:23:45

combining the you and and you actually not not really the best communism

00:23:49

so i i don't really go into detail but the so that if p. u. is better than any of it so

00:23:55

we have some older like this and the other way like this thing if you want to choose based to let problems

00:24:01

then so the an annual liked in not the best but combining p. u. p. n.

00:24:07

or are in u. p. n. a key that this combinations at least from the third

00:24:11

the risk analysis and also in because we we come from the spam no quite clearly

00:24:17

so then okay things a bit complicated but in in the in the end what we're doing is quite straightforward

00:24:22

so we have the you lining the enlightening and you're running and you just one high top army that eat up

00:24:28

and if you eat ice minus one we just use you line and then if we eat ice in if e. that's increased then it

00:24:36

have more influence opinion mining and then if you die zero or we

00:24:40

just use the you use we we don't use and a bit data

00:24:44

then if it dies for the increased it goes to a new learning so we just you know control

00:24:50

the balance based on this e. to hyper parameter and this you guys chosen based on post like this systematically

00:24:58

so that's the final network and then we can actually so that this is

00:25:02

actually quite interesting that we can act upon the generalisation error are in this way

00:25:07

and that's the got that's the stuff that we have one little basket the annual town

00:25:12

so in use the number of unable data so this basically shows that i watch

00:25:17

there there's an arrow decreases if we increase the number one a bit at that point

00:25:23

and in this division actually we didn't have any assumption on clusters or whatever

00:25:28

we're just assuming that bit up and sometime from coming from i can't

00:25:32

can i id from some two distribution so that the standard set up

00:25:36

but we can guarantee that that there is there's an error decreases if we have milan david data

00:25:41

so to the best of oh no it's i i think this is the fast

00:25:44

without that we obtain this kind of of my order without having additional class stuff function

00:25:51

and more technically debate we used and a bit

00:25:54

data is quite different from previous sensible by signing method

00:25:58

as in the previous my thought basically and they they they

00:26:00

that useful organisation well maybe propagation like we wanted to propagate

00:26:06

maybe it's all of our class stuff that was basically a big races on the biases out that

00:26:11

but in our case we are so method is based on new running and in in

00:26:16

new lining as as it used beta it's because it does noise in it at the data

00:26:21

what a new lining you data is regarded as noisy

00:26:24

p. data and we're using i'm the data for last evaluation

00:26:30

the meaning that actually it is useful views reducing the up by yes

00:26:35

but in regular there's some it is unlimited ideas for reducing

00:26:39

the variance so the way unabated days uses active quite different

00:26:44

so we are somehow extracting label information from and a bit data

00:26:47

so that's why we are cheap one was great when you come buttons

00:26:51

even without having class stuff since then we did some paul experiments here

00:26:58

then of course it's it's not that easy to say like a new

00:27:00

method works really well but at least because abyss brown we can guarantee that

00:27:05

um i thought it's safe 'cause sometimes if you use the standard since

00:27:09

the bubbles classification without then if the class or something is not really satisfied

00:27:13

then the performance can be even lost on like or is noticeable by standing on the front small number of data points

00:27:21

but this didn't really happen in experiments quite we that's about like famous previous my thought and somehow walking quite

00:27:27

but sometimes but performance it's quite like unstable in a sense but all my thought is quite stable in this way

00:27:35

i can simply say that i'm so that's quite good but at least on with a bit safe and it is their take

00:27:40

on the best outcome here okay that was he in the

00:27:47

eucharist isn't then let's try to not reduced for the simple reason

00:27:54

we started from t. aligning but now uh uh

00:27:59

how did something wrong

00:28:11

very strange read and so i

00:28:18

i think it's frozen

00:28:37

yeah so now the next one is the confiscation prop

00:28:42

so in p. e. learning we assume that the and you data available

00:28:47

but sometimes even under the data is not available for example if

00:28:50

iraq for some company and trade on the lights and customer data

00:28:55

there we are not allowed to access the database of other companies because of some privacy issues

00:29:00

so in that case you can only access the database up alone

00:29:03

company only posted data on it a bit out from one class

00:29:08

but then so uh in conclusion so lining from like only p. data is not really possible but it's basically

00:29:13

the same as one class classification an unsupervised learning we

00:29:16

can't really trained classifier systematically but we found out if

00:29:21

was the data is you keep with confidence so i

00:29:24

describe it does how it's like to class posterior probably depending

00:29:29

then actually we can solve the problem so it's go t. cough cough as opposed to convince 'cause there's

00:29:37

so okay uh uh there's gonna be offering the same same basically to be fast defined class get some risk like this

00:29:44

and maybe like we may think if like one point

00:29:47

is eighty percent positive then that point is twenty pasta negative

00:29:52

like we may have like eighty percent she uh as opposed to the data then

00:29:56

okay twenty percent listening at the data so that the standard way of integrating data points

00:30:02

but actually we can so that this is not the right way 'cause this doesn't agree with that to risk

00:30:07

because our data is coming only from posted class i did the

00:30:13

right expression is a bit similar but it's it's like this we just

00:30:16

need a few line to drive this we have this so then it's it's well minus all the all so this is the right way

00:30:24

then we can indeed to use an bigger risk minimises an approach to obtain the criterion like this

00:30:30

then again the problem can be solved and one like the thing is we have class prior pies yeah but now this is just

00:30:37

a proposal constant and we can drop it yeah it it so we don't we don't have to estimate by reality so that that's

00:30:45

simple but it does become focus gets and then we can simply

00:30:49

showed out again succumb buttons radius one the basket of and so

00:30:53

it's optimal in some sense and we did some experiments like to

00:30:58

experiment a bit difficult yeah 'cause there's no clear baseline to compare

00:31:03

and if you compare on the top is with p. n. lining then of course spin on is the best it works this way

00:31:10

but if you compare our method is the naive waited without that i have sown wrong

00:31:16

then our method that's better than this but it was so but if we see a yeah

00:31:25

it's a yeah ah right right so uh we're assuming that out i is like

00:31:32

no negative so meaning that this deposed in class with the covers the entire domain

00:31:38

for every basically we had using importance sampling but not ready

00:31:51

yeah oh right right uh_huh

00:32:02

huh

00:32:08

uh_huh uh right right so then the experiment active we also thought that

00:32:13

kind of thing so sometimes i get a confidence is quite low then it

00:32:18

because it's some numerical instability to be a bit like i if if convinces the pointer

00:32:22

one we just increase the good examples or five or something in the performance was systematically

00:32:28

yeah in in reality right so how to define the confidence it's maybe another issue

00:32:34

like when we use crowd sourcing for example then we ask five card will

00:32:38

cost to the the the same sample then we can have like either one possum

00:32:43

say yes awful part for people say yes by that we don't have twenty percent eighty percent or something like that that's one way

00:32:49

but maybe not but is still open so depending up the case and

00:32:52

we need to uh somehow that define the confidence in in the right way

00:32:58

but that was not really explored yet so far okay that's p. confiscation

00:33:06

the next let's father go to like we we even we got

00:33:10

support isn't so that now you you um it's you plus get some

00:33:15

so let's consider completely unsupervised scenario but now i suppose we have to set up on the the data

00:33:23

so this one and this one and our assumption is not so each and they they they thought contains examples

00:33:31

to be done something that the data he and here but we assume doubt class bias the proportions are different

00:33:38

like for this left on data that does that we have maybe fifty percent positive some fifty percent net it's awesome like that

00:33:45

for this one maybe seventy percent positive and thirty percent relative or something yeah

00:33:49

and we assume that probably has a different class by other different but i could we

00:33:54

don't have to know the prayers we just not not really assume that they are different

00:33:58

'cause otherwise if you have a single i maybe they does that we can just to compose it into it

00:34:02

just that they're having the same class but yeah for this doesn't work you assume that probably has a different

00:34:08

then actually we can so that only from these two i'm a bit it

00:34:11

does it we can lie on the decision boundary we can on the bottom boundary

00:34:16

but we can really know which direction is positive or with which side is positive in it

00:34:20

that but we don't have anything goes a bit data but still we can draw a line

00:34:26

the trick here exactly i waited to p. u. and uh as i said in the beginning

00:34:31

in the u. running we we got that you data as noise in it to the data

00:34:36

but in this you you learning both acting lazy maybe we we got this one as noisy

00:34:42

plus the point we've got this isn't listening to point and just separate these two that off

00:34:48

then we can somehow reduce the by we we can cure the bias completely and we can we obtain an unbiased edition

00:34:54

and again we can achieve one above get scared of components only from to set up and a bit data

00:35:00

so this is a bit association but it's completely unsupervised but

00:35:04

we can be training the classifier without taking any single day but

00:35:09

like i put this idea like i'm thinking of using this kind and thoughtful i quit collecting

00:35:14

data from two different hospitals for example and suppose we want to predict councils or not or something

00:35:20

then we don't really have to have any data but just collect no patient

00:35:24

information and just assume that like a proportion of cancer patients are different in the

00:35:29

hospital then without having any single it we can be trying to pacify in

00:35:34

principle so that the result here how do you go with miss who had yeah

00:35:55

does jupiter yeah so i'm not really directly working on images by massive but no

00:36:01

line of uh yeah if it's eric i it should be true and other than that

00:36:07

it happens that it can be we got it i id from some distributions then

00:36:12

it should work yeah i think we must but that are very nice for this time

00:36:21

yeah so then hear the small variation its goal is to you can skip some

00:36:25

so let's go to a basic it cause you some problem like about yeah

00:36:29

income or about yeah we did on our political opinion on things like that

00:36:34

and sometimes it's not that easy to say i've got my

00:36:37

opinion is yes and no so they bring expertise to quite difficult

00:36:42

but sometimes it easy to say oh actually i have the same opinion

00:36:45

as about possum why these two people have the same opinion something like that

00:36:50

so we don't have to know yes on or let me just say like two people have the same

00:36:54

opinion like like this so these to have the same opinion is to have the same opinion manic constraint clustering

00:37:02

but the clustering is unsupervised learning but this is actually supposed to guess

00:37:07

and and then in the end we can so that only from taxing it out

00:37:11

it appeals and and maybe that appoints we can really draw a decision boundary like

00:37:16

if i gain so we don't have anything related data we can't really know which

00:37:21

side exposed and we said the negative but at least we can draw a line

00:37:25

and again with all the all follow basketball band so i could check is quite simple 'cause similar that happy as can be

00:37:34

like the decompose into and a bit data then we have acted to set the one labelled

00:37:38

data and we can describe but same technique here so that that's how we solve the problem

00:37:45

and we can extend this tissue classification into the eucharist isn't this

00:37:50

similar up yes like these two people have different opinions i like that

00:37:55

it's only from the data on your data we can again train a classifier for these uh

00:38:02

i can completely unsupervised but we can solve the problem then the next one is complimentary classification

00:38:10

so this is like a slightly different from previous story now it's

00:38:13

'cause there are not the not the class problem more than two classes

00:38:19

for example what is the robot in this teammates then we have a lot of candidates like one hundred classes here

00:38:26

then use any use coke outsourcing phone maybe in this kind of data then caught the market has to go through

00:38:31

the label from one to one hundred okay this is it was eighty three so boston dynamic subplots or something like that

00:38:38

but this lading process it's really time consuming because selecting the

00:38:41

correct class from a long list of candidates it's really not painful

00:38:47

then so we decided to use what we call complimentary label

00:38:52

the tequila wrong later so the x. and y. bar and why bar is

00:38:58

the class that the sample does not belong to like this in it is

00:39:02

not a cat this unit is not that all this is not class one

00:39:06

like that so we may need me you want to have that kind of later

00:39:11

and it's quite easy to predict that kind of complementary information was once we randomly pick one of the classes

00:39:17

then it is almost always wrong right but if you have one hundred one

00:39:21

because it's an suppose it uh it's not them out the problem the single labour

00:39:25

problem then only one class it's quite an about ninety nine no wrong and if

00:39:29

you if you just want to pick one of them and it's almost always wrong

00:39:33

then call the market just say oh it's wrong it's wrong then we can put a daisy

00:39:39

then finally it's only from company they labelled data we can be the train

00:39:43

class right yeah that might not kelly the assume that these complementary information can be

00:39:49

made their data is uniformly distributed from rome classes we have ninety nine round

00:39:54

classes and we assume that that uh we had informed distribution hobart is ninety necklaces

00:40:01

but you can also have biased distribution if we have some prior

00:40:04

knowledge but for simplicity we just assume i'm a uniform distribution here

00:40:10

for this a complimentary problem okay we can cause that to possible approaches

00:40:15

that one is method called classification from possible day but it's so this actor

00:40:19

quite nice metal but now i might to attend it causes a provided for each

00:40:24

each sample like this sample belong to either art class one or two or

00:40:28

something like that and one is well one is correct and the other is wrong

00:40:33

so then so complimentary cost give some can be regarded as an extreme case of this possibly be scenario

00:40:40

we are like all c. minus one classes uh we got it uh as as the complimentary information posse information

00:40:49

then approach to is not really cracked but we may just use my

00:40:52

today because there's some right now each sample can belong to mark the classes

00:40:58

and then we just give negative label for why bar that we know it's wrong

00:41:03

and posted maybe it's for for the rest so this is not mathematically correct but impact this we may try this method

00:41:10

but we don't really solve the problem more directly

00:41:14

then so again i don't really go into the detail here but that's gonna be asked to from the same

00:41:19

wispy fast we define the cost estimates but this time it's it's not the class of things a bit complicated

00:41:27

then so originally the expect this was taken over r. p. of x. and y. ornery david data

00:41:33

but now we want to somehow replace this one did expect isn't over the bar

00:41:38

the p. bar is an ex basin or the distribution of our complimented a bit data

00:41:44

then and there's some come disown me done expressed this all into this fall

00:41:50

then so uh just by replacing this i expect this an over sample

00:41:55

rate by sample rate is then we can perform i'm bigger risking my says

00:42:00

so that was our previous people on this directory had beat modern or

00:42:04

without them we don't really need any like at two seconds something like this

00:42:08

but we can we have a general solution uh it isn't bit complicated but

00:42:13

we don't really have a like it's straight fought solution for any loss function

00:42:19

so then again we have like one of us there than combatants so meaning that as long

00:42:23

as we have wrong labour is then our classifier really come boxes so that i think quite nice

00:42:30

then so if we compare our mommy thought was possible label my thoughts are

00:42:34

wrong motive in without and it works it to quit but still this is

00:42:39

not the end of the story 'cause now we want to utilise this postulate

00:42:43

the approach to come bought much class the living issues in the yes no dating

00:42:50

instead of asking crowd will got to label these units and

00:42:53

selecting the correct labour from a list of one hundred classes

00:42:57

rather we just randomly choose one of the label and asking more hot is it correct

00:43:03

is this soft one paper or if it's yes then it's an ornery label it immediately bowden made

00:43:09

that quite a bit so that that should be corrected then okay is this i i robot remember

00:43:16

then it's no then it is regarded as complimentary label but in the

00:43:21

previous approaches complimentary they bizarre regarded that's useless so they're just discard it

00:43:27

but based on on the thought and we can also allow from this complimentary information then

00:43:31

now we can use board and we just add like m. bigger risk of of the

00:43:37

oh there is a bigger and bigger risk for the compliment it then we just

00:43:42

add them up then we can use both information and this really improve the performance

00:43:48

so we have like ordinary cost isn't just using only a buzz

00:43:53

then so complimentary classification just use complimenting there because the both actually week

00:43:59

so if we use both of them then the performance it systematically improve

00:44:04

this should be true that's a compliment they they but also contains

00:44:06

some information but in the ordinary case they'll even what but now

00:44:11

we also know extract information from from the company that made us

00:44:15

then the performance it systematically for this is actually quite interesting result

00:44:22

okay that was looking for a markov lining from weeks above his own so i quickly even to exhibit different methods

00:44:29

but now to the point is in all my thoughts we are just estimating the the cost estimates so meaning that we

00:44:37

don't need to combine all of them if you have the data and data you data pecans they thought income for data

00:44:44

simulate the disability reconnect all we can use all of them

00:44:48

to combine all of them together and estimate debates then it's

00:44:52

in each single data point can improve the performance that's kind

00:44:56

of find the result of our running from week it's about some

00:44:59

and we decided to write a book and we said that writing and hopefully we can probably still

00:45:04

in the tent be close to have more than one one yeah so are you spend more time later

00:45:10

them i think i didn't really explain them so so what those and as i said so

00:45:15

our method is the same model agnostic organisation agnostic so it's completely also not

00:45:21

okay in so it go development we always is that in your model it

00:45:25

the simplest and we can derive some dinner doesn't thereupon and things like that

00:45:29

but in the idea when we walk with industry partners we always use them know most complex model like this net or it

00:45:36

is the n. c. n. air and things like that so i'll just a mother is completely independent of so so blinding method

00:45:43

and also this is actually coming in from a result does the point there's no

00:45:47

there's no more the people that really probably just everyday so you need to be

00:45:51

the tape archive every day to see the latest people but we also want to

00:45:56

you can basically ignore that direct on and we just focus on developing lining the thoughts

00:46:01

but when about we we have applications we find the latest paper and combine it on top

00:46:07

then we can always hoping that without okay then so in the last couple minutes uh let

00:46:14

me go through another idea on noisy label lining this is somewhat new ideas recently we're working on

00:46:20

so noise robustness is really of course important in the idea we have a lot of sense i also human error out

00:46:26

been traditionally like people use for example also buys outlay

00:46:30

addiction to get to get rid of some erroneous data

00:46:34

but of course it's about was on is not real so applying i suppose

00:46:37

i'd like dixon is not that effective what traditionally we use across the statistics

00:46:43

so is that them using the x. squared lost the use that helps a little something that well even like jen gentlest function

00:46:50

this is a standard stuff or was the statistical approach but it's not strong enough

00:46:55

if you have a last normal like like aaron this data points it doesn't work

00:47:00

well we may use realisation on what you may just made noise transition

00:47:03

matrix and things like that but they are still not really you know

00:47:08

strong in in reality so we wanted to have some nobody approaches can overcome these problems with

00:47:14

the law for the planning them another pay but we had last year was called core teaching

00:47:22

and co teaching is based on the minimisation effect of neural networks so this is complete in in in people finding

00:47:28

but many people people said the clean data are outfitted

00:47:31

frost in the planning the noisy data in any but string

00:47:36

i suppose we have like the the set consisting of some some clean

00:47:39

data on some noisy data points and they're mixed together and the minibus training

00:47:44

then the in the fast like fast in the first half of the training

00:47:48

so clean data exactly fitted area but noisy data takes more time to feed

00:47:53

then once we stop training somewhere in the middle we can somehow fit out noisy data

00:47:58

from clean data so that's the basic idea and in this court teaching frame mark we used

00:48:04

to the to the neural networks and then each neural network basically six small roles instances

00:48:10

as clean data and give it to another network and both are doing the same thing independently

00:48:18

so then it's not properly trained only from the clean data

00:48:22

each other and in the end okay the performance is really improve

00:48:26

the for some that does that so it's not the class 'cause guessing does nancy thought then so we actively fifty percent

00:48:32

above it so it's extremely noisy but still like the lines

00:48:36

our thoughts it it works really read the performance doesn't really change

00:48:40

so this was completely impossible if you use just look at look robust loss function or something like that

00:48:45

it's uh can you provide to you know get rid of noisy data from from that somebody's that that's it

00:48:53

and for the uh we had new idea to at disagreement frame up to

00:48:58

this school teaching like me just update the networks if tonight locks the chubby

00:49:04

so by that we can keep like a and b. networks as different as possible for then the

00:49:09

performances like before they improve so that's competing been so this is the last one step palm pop

00:49:18

so i i talked about p. u. lining in the beginning and we had some new that the problem and and you said so we

00:49:24

used macs zero trick to keep this down on it did but in the eighties i must say that there was like a romantic behind

00:49:31

so uh uh that trick was so so the the back prop iteration

00:49:35

mean about iteration once a good this time becomes known this done because negative

00:49:41

then actually we didn't ignore about bats that me but simply rather uh we

00:49:47

have tasted tactic ready so we took the gradient and usually go down the guardian

00:49:52

but mostly fine dot com becomes negative then we have to go

00:49:57

back but edition behind is that so what was we follow this guy

00:50:01

the and we know that there is a plot local optimum they're all

00:50:05

about getting up and they're detected in just reason exist in that direction

00:50:10

then it's nice to go back and take another mini bats then take the

00:50:14

uh no stickers they got and then we we have maybe a slightly different direction

00:50:20

then we decide decide to go this way to buy that we can somehow avoid then just local optimum

00:50:27

so that was like the the original idea was used for p. land but now we have a kinda generates promote for

00:50:32

this and we can use it for noisy data so we

00:50:36

perform reading doesn't to step back to avoid simple local optimum

00:50:41

so there's no variable for this but empirically this works amazingly well

00:50:46

so we try to somehow justified the use of this idea but so far so we didn't have any

00:50:51

method the result but hopefully we can have some just guessing future okay summary so i talked about two inches

00:51:00

so first one was big reasonable that's got to get some so we want to so lining from small data

00:51:06

needs domain knowledge and of course using domain that it's quite important but we try to be also noted that direction

00:51:13

only from like statistical information we wanted to improve the

00:51:16

performance so idea was to use many week data in an

00:51:20

unbiased manner we had a systematic and bigger risk minimisation approach

00:51:25

and it's more that i'm gonna speak undergo this unlike most

00:51:28

it's exactly quite standard framework so you can this get at any

00:51:32

like domain knowledge or any heuristics that are used in each domain

00:51:37

then at the last part i talk about robustness in non thing so we try to

00:51:41

gold going go beyond the traditional approaches about the transition robust does this go for that

00:51:47

and we we're looking at this because the line process and we used memories isn't effect of neural networks

00:51:53

well we propose to grad in step back and also the used disagreement so

00:51:58

these are quite informal discussion of this moment the performance seems to be really improving

00:52:04

so they're most light so i i as i said in the beginning so i came from we can

00:52:09

and it's uh it's name we can send off for advancing basins project and this is like a penny a national

00:52:16

project so that in japan okay we're fully supported by the

00:52:20

ministry of education in japan for reacted quite academic and basic

00:52:24

the start that like twenty sixteen and it's ten yes source that we have seven yes doable

00:52:30

then so we're interested in neck fundamentals of a i like rick machine and hope my this

00:52:35

on applied math yeah back in as number of fundamental result doesn't know wasn't that then at

00:52:40

the same time we also interested in applications but we are like basically thoughts in there we

00:52:45

can is it that's pretty basic without centre so we don't really have we'll up because some people

00:52:51

so then we decided to have partners outside like we have part not a national council sent

00:52:56

our as a partner it's like the biggest cancel without sometimes upon cancer hospital in the palm

00:53:01

well we have national material material science results and that's a partner

00:53:06

well stencil research also quite popular and we have the nobel prize winner and

00:53:09

uh so we have those particles and we are really supporting the of science project

00:53:15

oh also we're interested in using a yeah a forceful sell good like the main clear

00:53:20

exactly radius of big big problem in japan and we're really looking with the means your problem

00:53:25

well management of infrastructure is also a big issue in japan as

00:53:29

we have uh not the bridges and tunnels in japan and they are

00:53:32

but quite quite old now fifty years or so and some bridges and tunnels a bit dangerous now we try to hit the our management

00:53:40

well i'm not so these are slides also because you don't have a lot of our breaks it's not me a phone for quite

00:53:47

quite they've got the production is also extended a lot in japan

00:53:51

but we have nice like it's about computer group yeah doing by simulation

00:53:56

but we are like yeah people and we want to have more

00:54:00

say we can provide that up predicts on taking the technology to them

00:54:05

and also like we have like a source a source assigns people like yeah in society without like

00:54:11

if the guidelines apostle data management only got these years so we have acted quite the boss without us

00:54:18

and now active you have more than seven hundred result does in our son thought we'd apart hundred fifty full time people

00:54:24

and our offices located really in the centre of taco at the it's

00:54:28

like a walking distance from the centre caucuses and it's really in the middle

00:54:33

then we have a careless number of industry partners and now we have forty plus in the c. project

00:54:38

and together with fujitsu in e. c. toshiba position we

00:54:42

have so what collaboration with thoughts and pass and we

00:54:45

have actually rooms in our centre and they are like walking also you know wasn't or and we're working together

00:54:51

so we also quite open to in the sequel brazen like this

00:54:56

and also we are quite happy to have more like international corporations

00:55:00

and of course having this thing without also having like intensive didn't is

00:55:04

always very nice and i had up in india you have didn't t. uh

00:55:08

nick from with a student interesting come to japan and a lot with us for like three months or pencil we're

Share this talk:

Conference Program

55:22

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama
May 28, 2019 · 11:04 a.m.

655 views

06:50

Q&A
Prof. Masashi Sugiyama
May 28, 2019 · 11:59 a.m.

114 views

Recommended talks

45:42

When foes are friends: adversarial examples as protective technologies
Carmela Troncoso, Assistant Professor at EPFL
June 6, 2019 · 11:09 a.m.

169 views

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama

Embed

Transcriptions

Conference Program

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama
May 28, 2019 · 11:04 a.m.

Q&A
Prof. Masashi Sugiyama
May 28, 2019 · 11:59 a.m.

Recommended talks

When foes are friends: adversarial examples as protective technologies
Carmela Troncoso, Assistant Professor at EPFL
June 6, 2019 · 11:09 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Recent advances in weakly-supervised learning and reliable learning Prof. Masashi Sugiyama

Embed

Transcriptions

Conference Program

Recent advances in weakly-supervised learning and reliable learning Prof. Masashi Sugiyama May 28, 2019 · 11:04 a.m.

Q&A Prof. Masashi Sugiyama May 28, 2019 · 11:59 a.m.

Recommended talks

When foes are friends: adversarial examples as protective technologies Carmela Troncoso, Assistant Professor at EPFL June 6, 2019 · 11:09 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama

Recent advances in weakly-supervised learning and reliable learning
Prof. Masashi Sugiyama
May 28, 2019 · 11:04 a.m.

Q&A
Prof. Masashi Sugiyama
May 28, 2019 · 11:59 a.m.

When foes are friends: adversarial examples as protective technologies
Carmela Troncoso, Assistant Professor at EPFL
June 6, 2019 · 11:09 a.m.