WEBVTT
00:00:04.089 --> 00:00:15.609
So welcome back to our lecture on Automotive Vision. And let
us continue at that point where we stopped last week. Okay,
00:00:15.609 --> 00:00:24.628
so we are talking about feature point methods, and we are
still in step number one concerning finding salient points in
00:00:24.628 --> 00:00:33.811
an image, finding points that are that in general, could be
recognized again in other images. And for that purposes, we
00:00:33.811 --> 00:00:43.939
were looking at the of a filter and we were
discussing that little passion of mass depends on a
00:00:43.939 --> 00:00:53.016
parameter Sigma. And depending on the value of Sigma, the
shape of the filter response changes. So the question still
00:00:53.016 --> 00:01:02.874
is, which value for segment to choose in order to get
appropriate results. And one step that we did was that we were
00:01:02.874 --> 00:01:11.976
finding out that, um, there is some between two operations,
the operation of an image and the operation of
00:01:11.976 --> 00:01:22.390
filtering an image with the filter. And this is illustrated
in this on the slide. So ah, if we want to apply both
00:01:22.390 --> 00:01:32.366
operations on an image. It doesn't matter in which order
we do it. So we could either start to um. First, take the
00:01:32.366 --> 00:01:40.940
original input image, filter it with the calcium filter with
a certain value for Sigma for the scale parameter Sigma.
00:01:40.939 --> 00:01:51.815
Then we get a blurred image, as you can see it here on the
left bottom after slide, and afterwards we rescue it by a
00:01:51.815 --> 00:02:01.677
factor's sea below one. If we make the image smaller
and see would be larger than one. If we would make the image
00:02:01.677 --> 00:02:09.194
larger. So in this case, let us assume we make the image
smaller. So we end up with the image at the right bottom of the
00:02:09.194 --> 00:02:15.870
slide. So instead of first filtering the image with the caution
filter, and then it we could also go the other way
00:02:15.870 --> 00:02:23.598
round. We first, we scale it by the fact to see. Then we get
a small version of the original image, which you see on the
00:02:23.598 --> 00:02:31.155
right upper corner of the slide, and afterwards we apply the
Gulf and Fit. But now we must be careful. We must adapt the
00:02:31.155 --> 00:02:40.388
scale factor see so um now
we use a caution filter with a scale parameter of Sea Time
00:02:40.388 --> 00:02:48.676
Sigma. And then we end up with actually the same result
that we get as if we would first filter the image with the
00:02:48.676 --> 00:02:57.852
caution filter, and then re scale it. The same principle
also applies for um a Laplashon of Gaussian filter. So for a
00:02:57.852 --> 00:03:07.337
Laplashon Ah Laploshan of gauge and filter. We can also
exchange and filtering. The only difference is that
00:03:07.337 --> 00:03:17.624
we, when we do their calculations, we find out that, yeah,
we can exchange things we have to consider, of course, the
00:03:17.624 --> 00:03:27.075
different scale in the filter. So we need to replace
the Sigma by sea time signal, and it is an additional
00:03:27.075 --> 00:03:37.251
difference. The equality only holds up to a scale of a
multiplicative factor of C square here. So when we, when you
00:03:37.251 --> 00:03:46.660
execute the proof, when you want to prove this property, you
will find that this factor of C square occurs. Um, that is
00:03:46.660 --> 00:03:55.552
a difference to the filter here, which we can see,
however, still the message is, we can exchange the both
00:03:55.552 --> 00:04:05.054
operations and and filtering with the laplosh. No
caution. Obviously, the slide has a mistake. So here at the
00:04:05.054 --> 00:04:17.153
right side, it shouldn't be that L O G, Sigma, um.
Otherwise, the equality doesn't make sense @unoise@ okay. That,
00:04:17.153 --> 00:04:28.367
was a long slide so now, what does that mean for us? well, we
started to say, okay, we want to compare to images and may
00:04:28.367 --> 00:04:37.791
be the object in which we are interested in M occurs in a
different size in the two images, like in this case, when we
00:04:37.791 --> 00:04:46.461
approach a vehicle from behind and in the first image, we are
far away from the vehicles of the vehicle occurs only in
00:04:46.461 --> 00:04:55.132
small and in the second case, we approached the vehicle,
and now the vehicle can be seen very large and we
00:04:55.132 --> 00:05:05.515
want to compare it. We have seen that if we use the same value
for Sigma. And then we run into trouble, because then the
00:05:05.515 --> 00:05:15.905
maximum of the filtered image occur at different positions
within the object, however. And now we can get rid of
00:05:15.905 --> 00:05:26.120
that by using different values of Sigma. So here we have a
kind of in the object appearance by a certain factor.
00:05:26.129 --> 00:05:35.436
And we know that we can compensate this effect, and by
changing the scale parameter Sigma in the . That means
00:05:35.436 --> 00:05:45.720
we would actually we would get the same we would refer
to the same point on the object, when we would consider the
00:05:45.720 --> 00:05:55.487
maximum that we get from the fielded image with a
small value for Sigma on the left hand side. And with a
00:05:55.487 --> 00:06:03.747
considerably large value for Sigma on the right inside. Then
we get the same maximum points relative to the object of
00:06:03.747 --> 00:06:12.884
interest. So that is the basic idea. So we need to search for
suitable values for Sigma. Such said," Yeah, we get rid of
00:06:12.884 --> 00:06:22.159
this scale problem that we get invariant of the scale in
which we see an object. So now, when we the left image
00:06:22.159 --> 00:06:31.291
now, when we, what we see is that the point, the maxima that
we get occurs at the same position as in the in the second
00:06:31.291 --> 00:06:39.128
image, however, still the question is, how do we find the
suitable values for Sigma. So we know there are suitable
00:06:39.128 --> 00:06:49.128
values for Sigma, but we do not know yet how to find them
@unoise@ well. The, idea is to go to the so called scale
00:06:49.128 --> 00:06:56.742
space. The scale space is defined as a function that depends
on three parameters. The first to our image cordiness, you
00:06:56.742 --> 00:07:05.976
and V as we know it from Gray Valley function. The third one
is the Sigma value, and the Sigma value actually is the
00:07:05.976 --> 00:07:16.195
value of the scale parameter signal. So the scale space
is defined in such a way that L of sigma is equal
00:07:16.195 --> 00:07:27.574
to. Well, we take the original gray value image G, filter
it with the L, O, G filter with a respective um scale
00:07:27.574 --> 00:07:38.646
parameter. Sigma, yeah. And Mhm % um. Yeah. Evaluate the
filter response at the position you V. So what we get is so to
00:07:38.646 --> 00:07:49.045
say, the filter response of an analogy field of this
varying value of Sigma. Well, and what we get is not a two
00:07:49.045 --> 00:07:58.643
dimensional image, but it is a three dimensional structure.
So to say, a sequence of images of filter responses with
00:07:58.643 --> 00:08:10.370
varying values of Sigma. And now what we are interested is
is to find the local maximum of this scale space function.
00:08:10.379 --> 00:08:20.143
Ere I am not finding only the local maximum with suspect to
the image coordinates you and B for a fixed value of Sigma,
00:08:20.143 --> 00:08:28.934
but also varying the values of Sigma and finding the maxima
and over varying values of Sigmund. The advantage is that
00:08:28.934 --> 00:08:37.088
this prevents interest points from moving depending on the
scale. And because we only consider the interest point
00:08:37.088 --> 00:08:46.597
assailant points on that scale, at which, uh, the Sigma
value is or at which a variation of the Sigma value
00:08:46.597 --> 00:08:56.182
decreases. The the output of this scale stays functional.
Now, in such way, we get in a certain day invariant
00:08:56.182 --> 00:09:05.713
respect to scaling, or in other words, for each interest
point, we can determine the optimal scale parameter Sigma now,
00:09:05.713 --> 00:09:16.458
which fits to this interest point now. So um typically um
in practice. Um, of course, this function L, again, is a
00:09:16.458 --> 00:09:26.431
function that takes three real valued variables, but similar
to you and we, which we can only evaluate, typically at
00:09:26.431 --> 00:09:36.243
Integer positions, because they are given as a matrix. We
also typically evaluate the function only for certain crisp
00:09:36.243 --> 00:09:46.361
values of Sigma. So not for all possible values, for all
possible real values of Sigma, but only for certain well chosen
00:09:46.361 --> 00:09:57.064
values. And one choice that is used in gift, for instance,
is to use values for Sigma, which are one. And if two big ah
00:09:57.064 --> 00:10:05.820
root of two, the cubic root of two square, the cubic group of
two to the power three, et cetera, et cetera. Also, we see
00:10:05.820 --> 00:10:14.446
always the ah cubic root of two orchors to the power something
and the power only takes on Integer values. Now, these
00:10:14.446 --> 00:10:23.836
are the typical scale parameters that are used in sift. That
means we can visualize the scale space function, as is this
00:10:23.836 --> 00:10:33.055
such a kind of three dimensional structure. Each layer of the
structure can be interpreted as an image. It is so to say
00:10:33.055 --> 00:10:41.748
the filter response of the gray value input image, filtered
with an analogy with a fixed value for six for signals, or
00:10:41.748 --> 00:10:51.309
one of those here and then the Sigma values varies from
layer to lay up. So here we have the on the lowest layer.
00:10:51.320 --> 00:10:59.758
We have the smallest Sigma value, equal to one on the next
layer, we have the second smallest uh Sigma value, a cubic
00:10:59.758 --> 00:11:13.052
root of two and so on and so on. And now we say, okay, a
point say the rat point is a local maximum in scale, space,
00:11:13.052 --> 00:11:26.362
peace, the L function, the scale space function L is at least
also at this red pixel @unoise@ is larger or equal to all
00:11:26.362 --> 00:11:40.595
the values of the function. All the other twenty um
twenty. What is it? twenty six. Ah. Neighboring pixels. Now,
00:11:40.595 --> 00:11:50.415
those which are neighboring in space. Yeah, those which we
get by varying the U, V coordinates by plus mine as one and
00:11:50.415 --> 00:12:01.040
those by which we vary the scale parameter Sigma by plus mine
is one. Now that means, if this value here of the red um
00:12:01.040 --> 00:12:10.377
pixel, or the red position is whatever twenty you know. And
we find a neighbor here in the same case with a larger value
00:12:10.377 --> 00:12:21.198
than the red pixel is not a maximum. If we find if the value
is twenty, and on the next layer here we find a pixel in
00:12:21.198 --> 00:12:31.442
the local vicinity over the value of twenty two, then the red
pixel as well is not a maximum of interest for us. Only if
00:12:31.442 --> 00:12:38.845
all the other twenty six pixels, which are shown here have
smaller values, or at most equal values. We argue that the
00:12:38.845 --> 00:12:48.184
red pixel is a local maximum, and that we consider it as an
interest point. So here is an example for a one-dimensional
00:12:48.184 --> 00:12:59.805
function. So that is the gray value function in blue. Yeah,
shown here. And now I have taken. Ah, I filtered this um
00:12:59.805 --> 00:13:09.988
grey value function with an g is varying value of
Sigma @unoise@ considering also this % Uh C @unoise@
00:13:09.988 --> 00:13:20.170
compensating this C squared dependency and then the
filter response look like the image here shown below the graph
00:13:20.170 --> 00:13:29.952
below. So what do we see each color line refers to one
filter response for a certain value of Sigma. Yeah, the
00:13:29.952 --> 00:13:39.529
scale is given by these numbers. So these numbers refer to the
. So for instance, here scale number twelve. That is
00:13:39.529 --> 00:13:48.048
the red line here. It refers to a Sigma value of a cubic root
of two to the power twelve and scale thirteen. Then refers
00:13:48.048 --> 00:13:57.946
to the cubic group of two to the power thirteen and so on. So,
and now we have to consider which point here is a maximum
00:13:57.946 --> 00:14:09.992
in scale space. Well, Ah, let us have a look
where. Do we find maxima obviously, here is a point. There is
00:14:09.992 --> 00:14:17.536
no other point that is larger than that. This seems to be a
maximum. And of course, that is true. If we go to the left
00:14:17.536 --> 00:14:25.158
or to the right. There is no larger point. And if we change
the scale. So this is scale number twenty, five. If we go to
00:14:25.158 --> 00:14:33.699
the neighboring case. Twenty six, we end up here. So that is
definitely smaller. And if we look at the other neighboring
00:14:33.699 --> 00:14:43.707
scale. Twenty four. It is this one. Here we see that also
smaller. That means this point here at the top. That is
00:14:43.707 --> 00:14:51.242
definitely a maximum in scale, space @unoise@ um. However,
let us consider other points here. Let us consider, for
00:14:51.242 --> 00:14:59.903
instance, this point here now. Well, on scale number, what
is it? twenty seven. Are you on the violet curve? obviously,
00:14:59.903 --> 00:15:10.425
this point is maximum. If we change the coordinate a little
bit. If we go a little bit to the left, or it limit to the
00:15:10.425 --> 00:15:18.134
right. The filter responses is decreasing. However, if we
go to neighboring scales, things are different. So if it
00:15:18.134 --> 00:15:27.058
change from scale number twenty, seven to go to scale number
twenty, eight. This is shown here. Ok, then the scale space
00:15:27.058 --> 00:15:35.651
function decreases. That is still okay. Um? however, if we
go from scale twenty seven to scale twenty six scale. Twenty
00:15:35.651 --> 00:15:47.026
six is this then we see the scale. Space function is
increasing. That means this point is not a maximum in scale
00:15:47.026 --> 00:15:55.986
space, not a local maximum scale space, not because we can
change the scale and easily get points which are larger
00:15:55.986 --> 00:16:07.557
@unoise@ okay. Other, points well. If we consider, for instance,
a scale number thirteen. That is this one here. Then we
00:16:07.557 --> 00:16:14.796
find something that looks like a maximum here. If we go to the
left or to the right. The function is decreasing. That is
00:16:14.796 --> 00:16:21.954
good. However, if we change the scale again, we see, okay,
going to scale number twelve, while thus function is
00:16:21.954 --> 00:16:29.333
decreasing. That is good. However, going to scale number
fourteen, we see that the function is in crazy. And that means
00:16:29.333 --> 00:16:38.068
this point here on scale. Number thirteen is not a local
maximum in scale space. Now it is a local maximum. If we only
00:16:38.068 --> 00:16:46.047
fix the scale and only consider you and we the spatial
coordinates, but that is not what we are interested
00:16:46.047 --> 00:16:55.640
in. It must also be a a maximum. If we change the scale if he
varies the scale. That means, in this case, only one point
00:16:55.640 --> 00:17:05.088
it is uh on the one point is a local maximum. We
suspect to both the spatial coordinates Uv and the scale
00:17:05.088 --> 00:17:14.493
parameter signal, while the other things that seem to or
look like maxima local maximum. They aren't because they are
00:17:14.493 --> 00:17:24.662
not. Ah, we can vary the Sigma parameter, and Ah achieve
Ah values of the scale space, function, Ah, which is Ah,
00:17:24.662 --> 00:17:33.449
which are different, of which are large, sorry. And therefore
they are. These points are not local maxima, not of
00:17:33.449 --> 00:17:44.820
interest for us. Okay, so we are searching the local maximum
scale space. And yeah, these are the points that we get so.
00:17:44.829 --> 00:17:56.514
And now @unoise@ so. For the basic understanding of how
sift detects points, how sift finds interest points. This
00:17:56.514 --> 00:18:05.740
understanding that was shown on the last light. This one is
sufficient. Yeah, knowing you filter the image with varying
00:18:05.740 --> 00:18:14.962
different allergies, with varying Sigma values, and then
you will search for local maximum. This is nice. This is
00:18:14.962 --> 00:18:22.990
sufficient for understanding the hello, the sift method. But
it is not very efficient. Why isn't it efficient while it
00:18:22.990 --> 00:18:33.269
isn't efficient. Because if we filter an image with a large
with Elo G, with a very large value of Sigma. Then we get a
00:18:33.269 --> 00:18:48.741
a strong blurring. And then, well, of course, all the
details are lost. Now that is clear, and actually we spend a
00:18:48.741 --> 00:19:00.760
lot of time in calculating a very blurry image with only
@unoise@ with not that much information that is contained any
00:19:00.760 --> 00:19:09.700
more in the image. Because due to this blurring effect,
though, there would be one way to get more computational
00:19:09.700 --> 00:19:18.726
efficient, namely by fruits the image to a smaller size
and afterwards doing the filtering. We have seen that both
00:19:18.726 --> 00:19:26.776
processes, namely this and the filtering can be
exchanged. Therefore, if we want to use very large values for
00:19:26.776 --> 00:19:35.180
Sigma and filter the image with very large values of signal.
It is much more computation efficient to first scale the
00:19:35.180 --> 00:19:44.730
image to a smaller size, and afterwards applied the filtering
well. And this is implemented using something that is
00:19:44.730 --> 00:19:54.209
known as an image pyramid in Ah! computer vision. So how does
it work? well, when we look at the Sigma values in
00:19:54.209 --> 00:20:02.514
which we are interested in. We see it is the first one is one.
The second one is a cubic root of two of two square
00:20:02.514 --> 00:20:10.801
and cubic root of two to the power three, which of course, is
equal to two. And then we continue Ah, with values between
00:20:10.801 --> 00:20:20.256
two and four, and then between four and eight, and then between
eight and sixteen and so on. @unoise@ if we are here at
00:20:20.256 --> 00:20:29.598
that point and say," Okay, Sigma is equal to two, then this
means, well, what did we could we do? well, instead of
00:20:29.598 --> 00:20:38.102
taking the original image, filter it with an allergy with
signal equal to two. We could also do say, okay, we first
00:20:38.102 --> 00:20:50.962
scaled the image to half size, and afterwards, afterwards.
Um apply an allergy filter on the image with a Sigma
00:20:50.962 --> 00:21:02.490
value of one Mhm. Now, the effect is actually the same. Now
we have seen that we can get actually the same result. And
00:21:02.490 --> 00:21:09.709
by doing that um as if we would first filter the image with
an Allergy with Sigma value of two, and afterwards we scale
00:21:09.709 --> 00:21:19.722
it. What is something that we have shown @unoise@ okay and
that gives us the idea, what do we do when we calculate this
00:21:19.722 --> 00:21:29.449
um scale, scale, space. Well, for the first four steps. The
best thing is to take the original full size picture and
00:21:29.449 --> 00:21:40.317
apply the filter with Sigma values of one a cubic root
of two, to make root of two squared and two. This creates
00:21:40.317 --> 00:21:51.764
the first layer of this pyramid @unoise@ afterwards. We scale
the original image by a factor of two, so that we get an
00:21:51.764 --> 00:22:00.942
image that has half the width and half the height of the
original image. And afterwards we want to create the next scale
00:22:00.942 --> 00:22:08.667
layers. But we don't calculate them on the original image
size, but on this half sized images. And instead of using
00:22:08.667 --> 00:22:17.284
filters with the original Sikh. My values, which are
shown here. We again start with seek my equal to one. Take my
00:22:17.284 --> 00:22:27.717
equal to a cubic root of two, a Sigma equal to a cubic group
of two square and a Sigma equal to two. This yields the
00:22:27.717 --> 00:22:39.729
second layer of the pyramid, and afterwards we continue.
Here we again um calculate the, or we scale the image to a
00:22:39.729 --> 00:22:51.976
quarter size. And then we go on like that to create the third
layer of the pyramid. Then we again rescue the image to
00:22:51.976 --> 00:23:02.115
size of an eighth of the original image continue like that.
And by doing that, we see that we only use large images for
00:23:02.115 --> 00:23:10.526
the first steps and for the next steps, we use smaller images,
and we also can use smaller filter masks, and so that the
00:23:10.526 --> 00:23:17.370
whole process is more computational and more efficient.
So this is trick just to make things more efficient.
00:23:17.380 --> 00:23:26.870
Computationally, it doesn't change very much the result um,
but it makes calculations much more efficient @unoise@ so.
00:23:26.880 --> 00:23:40.125
Let us apply that to an example image, say, the stack of of
books here on on a on a desk. And yeah, the result when we
00:23:40.125 --> 00:23:56.792
do that is shown here so all the red circles indicate
interest points that have been found local maxima in scale
00:23:56.792 --> 00:24:06.073
space, of course, Mhm. Yeah, so what is shown
here are only those maximum which are considerably different
00:24:06.073 --> 00:24:14.153
from zero, which are above a certain threshold because, of
course, due to noise in the image, small noise. They are also
00:24:14.153 --> 00:24:21.592
local, maximum everywhere here, for instance. But these
local maximum are so small that we say, okay, they are just
00:24:21.592 --> 00:24:30.377
affected by noise and image. They are only considering those
maximum, which have are considerably different from zero
00:24:30.377 --> 00:24:41.338
are shown here in this image. So each of the red circles
refers to one image, and the circle radius provides the scale
00:24:41.338 --> 00:24:51.009
parameter, which applies for that point. And the scale value
Sigma, for which this was a maximum. We can see that, for
00:24:51.009 --> 00:24:59.580
instance, for small structures like here on the corner on the
edge of the book. There are only very small circles. It is
00:24:59.580 --> 00:25:07.230
a very thin structure. Therefore, the maxima occur on a very
small scale, while for large structures, say this letter.
00:25:07.240 --> 00:25:17.180
See here in the title of the book. That is a considerably
large structure, and therefore the maximum also occurred on a
00:25:17.180 --> 00:25:27.975
very large scale to refer to this large structure and the
image @unoise@ okay. What, we also see is that the maxima
00:25:27.975 --> 00:25:39.676
occur mainly at edges of objects here on the book edges, or
textured areas in the image but, not on homogeneous
00:25:39.676 --> 00:25:49.606
areas @unoise@ yeah That, is it Ah. @unoise@ so ok but
still these maximum are not yet completely um sufficient
00:25:49.606 --> 00:25:59.343
for for our calculations, because what we have seen is that
a maximum occur on gray level corners and on gray level
00:25:59.343 --> 00:26:08.722
edges, and with gray level edges, there is a problem. Consider
this example. So twice the same situation, but a little
00:26:08.722 --> 00:26:18.740
bit rotated. It may be taken from a different perspective,
but still the same basic structure, a black corner on white
00:26:18.740 --> 00:26:28.583
background. If you consider a such a local maxima, which occurs
at a corner of the black of the black rectangle, then it
00:26:28.583 --> 00:26:37.267
is easy to determine the same position in the right image.
Now we can easily say this refers to that point that is
00:26:37.267 --> 00:26:44.445
equal. If we consider a maximum here at the corner somewhere
then. Well, where is the corresponding point here in this
00:26:44.445 --> 00:26:51.819
image? well, we actually do not know exactly somewhere here,
maybe a little bit more to the left, maybe a little bit
00:26:51.819 --> 00:27:01.742
more to the right. We do not know exactly. So while those
points which are occurring at the corner at corn. Up positions
00:27:01.742 --> 00:27:11.470
can be determined very accurately. And in another image, those
which occur at gray level. Edges um cannot be found, and
00:27:11.470 --> 00:27:19.976
that accurately in another image, and therefore what we want
to do is we want to eliminate those local maximum, which of
00:27:19.976 --> 00:27:29.723
career at gray level corners. And only sorry. We want to
eliminate those which occur at grey level edges, and only
00:27:29.723 --> 00:27:40.420
um keep those which occur at gray level, corners like that.
How do we do that? well, the question is, how reliably can
00:27:40.420 --> 00:27:51.571
we Yeah? find that position again @unoise@ and the
solution for that problem is to analyze the structure of the
00:27:51.571 --> 00:28:03.075
scale space function in a local vicinity, around, around the
maximum that we have found. The idea is, well, if the
00:28:03.075 --> 00:28:12.373
scale space function @unoise@ is rather flat around the
maximum, then small changes in the image might have strong
00:28:12.373 --> 00:28:22.649
effects on the exact position of the maximum, while if their
function is not flat around the around the maximum,
00:28:22.649 --> 00:28:33.904
so that we have a very clear maximum, then it is very reliable,
because more changes than the image won't have a strong
00:28:33.904 --> 00:28:42.744
effect on the position of the maxim. So for that purpose,
we approximate the scale, space, function in the local
00:28:42.744 --> 00:28:50.642
vicinity of the maximum that we have found with a local
quadratic approximation. So we calculate the Taylor polynomial,
00:28:50.642 --> 00:28:58.618
uh, of degree, to around this position. And this means um,
the scarce pace function around the maximum that we have
00:28:58.618 --> 00:29:06.796
found. Now, if we go a little bit to the left, a little bit
to the right, a little bit up and down if he varies the
00:29:06.796 --> 00:29:13.639
position by Delta U and Derby is equal too approximately
equal too. The scale, space, function, value at a position
00:29:13.639 --> 00:29:21.205
plus, while the first order derivative of the
function in horizontal direction. We suspect to the you
00:29:21.205 --> 00:29:29.210
coordinate Times Delta, you plus the first order, much
derivative aspect, to the Times Delta V, plus a half times well.
00:29:29.220 --> 00:29:37.069
And this is actually the second order. Derivatives of age is
the hash, and the the matrix of second order. The relatives
00:29:37.069 --> 00:29:44.889
multiplied from the left and right with the vector delta
so we know that we are considering a maximum so that you
00:29:44.889 --> 00:29:53.069
V. Zigma is a maximum. That means we can conclude that the
first order derode to expose a zero, because we are at a
00:29:53.069 --> 00:30:01.659
maximum. And that means what we get is actually the
function at the position you be Sigma plus a half times. Ah,
00:30:01.659 --> 00:30:12.676
while the multiplied with delta you Delta V from
the left and from the right, so. And now, ah, we ah can
00:30:12.676 --> 00:30:22.253
analyze the local structure of the of this ah approximation
to the grail of value, structure, Ah, grey value, um, grey
00:30:22.253 --> 00:30:33.068
value. Ah, function @unoise@ um. Yeah, actually, as we can see
it here, the local structure depends only on age, on this
00:30:33.068 --> 00:30:43.044
Hashen now and it turns out when we analyze it a
little bit more carefully. If you would analyze it a more, a
00:30:43.044 --> 00:30:51.995
little bit more carefully @unoise@ if. Age our age has a
certain structure. First of all, it is a second ordered
00:30:51.995 --> 00:31:00.527
relative. So it is a symmetric matrix. And in this case, we
can also conclude that it is a positive. See me. Definite
00:31:00.527 --> 00:31:15.207
matrix @unoise@ well. Is it true Mhm? no. A negative semi
definite matrix must be. And so it has a certain structure,
00:31:15.207 --> 00:31:23.563
and this means it also has some Eigen values. And there'be two
eyeing values to ponentially different eyeing values. And
00:31:23.563 --> 00:31:34.692
if both of them are very different from zero, then the
function is not flat at this position, then we have a very
00:31:34.692 --> 00:31:43.041
distinctive maximum. While if one of the two eyeing values,
or if both of the Iig values are close to zero, then this
00:31:43.041 --> 00:31:50.072
means the function is flat in certain directions. That
means then at least in certain directions, we can vary the
00:31:50.072 --> 00:32:01.424
position, and the function will not change along. And that
means the maximum is not that well defined. It is not that
00:32:01.424 --> 00:32:13.408
distinct. And therefore we would like to eliminate this
position from the list of interest points. So there is one
00:32:13.408 --> 00:32:21.049
criterion to do that, to calculate whether the twiging
values are close to zero or not. This is shown here if this
00:32:21.049 --> 00:32:29.483
applies, yeah, where R is a certain threshold that we have
to define in order to to set whether the lying values are
00:32:29.483 --> 00:32:37.549
different from zero or not, then we can decide whether to
accept a certain point of interest, or whether to accept a
00:32:37.549 --> 00:32:44.953
certain local maximum as point of interest or not. Actually,
the thing that we do here is very much similar to what we
00:32:44.953 --> 00:32:51.395
know in computer vision as the Harris Stephen Harry's corner
detector. So for those of you who know that already may be
00:32:51.395 --> 00:32:58.340
from the lecture of machine vision. The thing that we do
here is actually very, very similar to this. Stephen Harry's
00:32:58.340 --> 00:33:05.329
corner detector. While for the other Sudan, not yet. Now, the
Harris Corner detector that would be a reference to get an
00:33:05.329 --> 00:33:12.602
understanding why it is like that. But I don't want to go
in all the details, but just show you this result so
00:33:12.602 --> 00:33:20.506
there is a possibility to to find out whether we are on
edge or in corner and if we are at a corner we
00:33:20.506 --> 00:33:29.393
accept a point as interest born. And if we are close to an
edge we rejected. Let us have a look at the result. So that
00:33:29.393 --> 00:33:39.327
is actually the image that we have just seen with all the
local maxima, which are above the threshold. And now let us
00:33:39.327 --> 00:33:46.985
filter those according to that criterion. And here you find
the two results. One for a threshold value of one point
00:33:46.985 --> 00:33:56.838
five, one force, a short value of two. And what we can see
is that those, those maximum which are here on the on the
00:33:56.838 --> 00:34:05.054
edge of the book. For instance, they all have disappeared here
as well as here, while all those points which are next to
00:34:05.054 --> 00:34:12.639
corn as they hear or hear, or in texted areas, they survived.
They are accepted while all those at edges are rejected.
00:34:12.639 --> 00:34:21.564
We can also see it clearly here at this chair. So here at the
corner of the chair. The point is accepted, while at the
00:34:21.564 --> 00:34:32.291
edge of the chair, all the points are rejected now @unoise@
so. And this yields um the basic idea how sift detects
00:34:32.291 --> 00:34:41.629
interesting points. So to summarize that, oh no, yet not
summarized. What do we do? we filter the image with a
00:34:41.629 --> 00:34:49.758
filter with varying values of Sigma, then research in the
scale space for local maxima. Afterwards, we throw away all
00:34:49.758 --> 00:34:58.500
local maximum, which are close to zero, because they are
like very likely to be generated just from noise. And
00:34:58.500 --> 00:35:05.704
afterwards we check for each local maximum, whether it is close
to a court rail of the corner, or whether it is close to
00:35:05.704 --> 00:35:12.309
a rail of ledge, and we throw away those which are only
close to edges and natural corners, and then the remaining
00:35:12.309 --> 00:35:19.691
points are all interest points. Ah, which? Ah, we further
consider for all the further calculations. And by doing that,
00:35:19.691 --> 00:35:27.154
we get a set of points, typically for one image, we get
several hundred points, which are interest points, which have
00:35:27.154 --> 00:35:36.382
the potential to be seen again and recognized again in another
image from the same sea. So the next step in a feature
00:35:36.382 --> 00:35:45.825
point methods is to calculate a description for each point.
A description means a vector with numbers in with entries
00:35:45.825 --> 00:35:54.595
that describe the point and its local vicinity, which described
the appearance of the local vicinity of the point, so
00:35:54.595 --> 00:36:04.242
that afterwards, when we have two images, and we have the
descriptor of all interest points in both images, then we can
00:36:04.242 --> 00:36:12.732
just compare the descriptors and find points which have the
same or very similar descriptors. And then we can conclude
00:36:12.732 --> 00:36:21.157
that those points are likely to refer to the same point in
the three dimensional world. Now that is the idea for that
00:36:21.157 --> 00:36:30.273
purpose. We need descriptors. Okay, of course, descriptors
should also be in variant respect to rotation and scaling and
00:36:30.273 --> 00:36:40.451
% Uh aspect variation and illumination as the method to find
interest points also has been invariant respect to these
00:36:40.451 --> 00:36:48.628
properties. Okay, so let us start with the basic idea and
something that we already started with when we talked about
00:36:48.628 --> 00:36:56.935
rock matching in binocular vision, where we talked about block
matching. What did we do? well, we were taking a a
00:36:56.935 --> 00:37:05.345
local patch of the of the image around a certain point of
interest for which we wanted to calculate the disparity, and
00:37:05.345 --> 00:37:12.990
then we are searching in another point, in another image,
points which have which they share the same environment here,
00:37:12.990 --> 00:37:21.399
where the environment looks very similar to the to the image
patch, which we have taken from the from the first image.
00:37:21.409 --> 00:37:30.178
And that means taking a certain patch around a point of interest,
and then writing down all the grey values into a large
00:37:30.178 --> 00:37:37.718
vector, established a first version of a possible descriptor
that would be one descriptor that we could use. Of course,
00:37:37.718 --> 00:37:45.074
this descriptor has some disadvantages. It is not in very
and respect to scale. It is not invariant respect to um, to
00:37:45.074 --> 00:37:53.496
illumination changes. It is not in variant respect to rotation,
but it at least could be a choice or such a descriptor
00:37:53.496 --> 00:38:01.721
in some cases where all these properties are not relevant.
Now that means we take a small patch around a point of
00:38:01.721 --> 00:38:10.428
interest, say this upper corner of this of this lorry. And
then we just write all the gray values of the this patch into
00:38:10.428 --> 00:38:18.538
a large vector, and that would be a simple, very simple
version of a descriptor. And when we have another image
00:38:18.538 --> 00:38:26.936
@unoise@ for more or less the same scene @unoise@ um. We also
search for all interest points. We might also find
00:38:26.936 --> 00:38:35.400
the left upper corner of this @unoise@ Ah lorry % um. We again
take a patch, a small patch around this interest point.
00:38:35.400 --> 00:38:43.939
Write all these gray values into a large vector, and then we
can compare vectors with each other and check which factors
00:38:43.939 --> 00:38:51.886
from the first imagination, from the second image are most
similar. And then we would find, say, argue, okay, there
00:38:51.886 --> 00:38:59.625
seemed to be a pair of corresponding points, as we did for
binocular reconstruction, not for calculating the disparity
00:38:59.625 --> 00:39:08.292
of points. Okay, so that would just an here to give you an
idea of what such a descriptor could be @unoise@ but. Of
00:39:08.292 --> 00:39:14.891
course such a descriptor has shown here @unoise@ is
not in variant respect to elimination. It is not invariant
00:39:14.891 --> 00:39:21.200
respect to rotation with @unoise@ respect to um scaling
@unoise@ so it is not suitable for our purpose @unoise@ okay.
00:39:21.210 --> 00:39:32.645
So, we need something clever, more clever @unoise@ okay so,
ah, okay, what could we do? so one thing that we already
00:39:32.645 --> 00:39:40.280
found is taking grey values as such is not invariant respect
to elimination. We should always consider something like
00:39:40.280 --> 00:39:48.270
differences of gray values or gradient information. So taking
the grade, and also means we are comparing grey values.
00:39:48.280 --> 00:39:56.481
That means we calculate differences of gray values. Okay, so
let us consider gradient information instead of grave all
00:39:56.481 --> 00:40:05.924
the information with such an idea, we get already rid of
illumination variations in illumination. So the second step is
00:40:05.924 --> 00:40:25.104
to say % um. We want to no, we want to compress the information
a little bit to the really relevant information @unoise@
00:40:25.104 --> 00:40:37.957
from other research in computer vision. It is well known
that gray level information as such is not that does not
00:40:37.957 --> 00:40:45.943
contain that much information. Ah, for the purpose of
recognizing objects or recognizing things. But gradient
00:40:45.943 --> 00:40:55.578
information is very . Contains a lot of valuable
information. So let us consider the grading and the gradient
00:40:55.578 --> 00:41:04.896
orientations instead of just containing concerning about
yeah and let us compress this information into a
00:41:04.896 --> 00:41:13.180
more compact form. So how does that work? that is a little
bit more complicated. What do we do in the center here of
00:41:13.180 --> 00:41:21.376
this image, we assume there is the interest point for which
we want to cake led the descriptor. Now, what do we do? we
00:41:21.376 --> 00:41:28.911
take a rectangular area around this point. And for that
rection malaria, we calculate all the gray value gradents. So
00:41:28.911 --> 00:41:38.170
for each pixel, we gave gray level gradient. And now we want
to compress the information of gray value gradients into a
00:41:38.170 --> 00:41:46.857
more compact form and be doing. They are doing that by
calculating orientation histograms. What is an orientation
00:41:46.857 --> 00:41:55.624
hysterical. So for all these pixels here, for which we have
calculated the gradient. We first calculate the orientation
00:41:55.624 --> 00:42:05.388
of the great end. And then we um are doing a . That means
we do not want. We are not interested in the exact angle
00:42:05.388 --> 00:42:13.062
of the gradient, but only whether it is between a zero degree
in forty five degrees, or whether it is between forty five
00:42:13.062 --> 00:42:20.519
and ninety degrees, or between ninety and one hundred and
thirty, five degrees and so on. So we categorize the angle of
00:42:20.519 --> 00:42:29.449
the grade, and in to say, eight categories, and depending on
the main orientation. Now we collect all the pixels um in
00:42:29.449 --> 00:42:39.949
the local vicinity of the interest point, which have a
gradient that points to say the which into a direction between
00:42:39.949 --> 00:42:51.139
zero degree and forty-five degree. And for all these, we sum
up the length of the gradients? well, this uses a certain
00:42:51.139 --> 00:43:04.798
number, and this number tells us how or dominant this
grade in direction is in the local vicinity of the point of
00:43:04.798 --> 00:43:13.534
interest @unoise@ yeah. Of, course we are not doing it just
with this Ah interval of angles between zero and forty, five
00:43:13.534 --> 00:43:21.234
degrees. But we are also considering the second bin, all the
pixels which have a grade end that points into the direct
00:43:21.234 --> 00:43:28.848
direction between forty five and ninety degree. Again, we
collect all those pixels, and we add up the lengths of the
00:43:28.848 --> 00:43:36.248
grade end of all those pixels, which yields another number
that tells us how dominant this direction is. This grading
00:43:36.248 --> 00:43:45.858
direction is in the vicinity of the interest point like that.
We do it for all possible bins. That means what we get in
00:43:45.858 --> 00:43:55.547
this orientation is to grammar eight numbers, and those eight
numbers tell us how often or how strong gray level edges
00:43:55.547 --> 00:44:06.160
in those eight main directions occur in the vicinity of the
of the % uh % eh interest point well @unoise@ and.
00:44:06.170 --> 00:44:15.515
Afterwards we could argue Ok, we take this factor with these
eight numbers and use it as a descriptor for that point
00:44:15.515 --> 00:44:24.010
@unoise@ that would be one possibility still a very simple,
simple descriptor, but it tells us something about the local
00:44:24.010 --> 00:44:31.327
structure around the point. And of course, with such a thing,
we could, for instance, already determine or distinguish
00:44:31.327 --> 00:44:39.911
um very sharp corners and corners, which are not that sharp,
you know, uh, if we have a ninety degree corner um this
00:44:39.911 --> 00:44:48.121
descriptor would, uh, look differently, as if you would have,
uh, um, a corner that has an angle, uh, between the two
00:44:48.121 --> 00:44:57.613
edges of, say, forty five degree or one hundred thirty five
degree. Yeah, so it the the the scriptor would already be
00:44:57.613 --> 00:45:06.619
different for different kind of corners, but still in one
of these orientation histograms, the information is already
00:45:06.619 --> 00:45:17.642
compressed too much. Therefore, we um extend this approach a
little bit by subdividing the rectangle around the point of
00:45:17.642 --> 00:45:27.367
interest into several sub areas. In the simplest form that
is shown here in the on the slide. We would subdivide this
00:45:27.367 --> 00:45:38.689
area into four now. This left upper area, the right
upper areas in the lower area, in the right low area. And for
00:45:38.689 --> 00:45:46.393
each of those areas, we calculate separate orientation histograms.
That means we would get an orientation histogram with
00:45:46.393 --> 00:45:55.449
eight values for the upper left area, one orientation histogram
for the upper right area one. His orientation is to come
00:45:55.449 --> 00:46:06.134
from the lower left and one further lower right area. Those
are ill illustrated in the on the slide at this position,
00:46:06.134 --> 00:46:17.048
with the star shape diagrams. So they should indicate this
orientation histogram for eight main directions. They show as
00:46:17.048 --> 00:46:27.907
their length, which should indicate or illustrate better to
say the respective sum of great and lengthness of all the
00:46:27.907 --> 00:46:33.880
pixels that are belonged to the
respective Uh orientation been now.
00:46:34.170 --> 00:46:45.569
Now we have four times eight entries in the orientation
histograms. Now we them and what we get is a thirty two
00:46:45.569 --> 00:46:54.138
dimensional factor. And this thirty two dimensional vector
can again be used as a descriptor, a descriptor for that
00:46:54.138 --> 00:47:03.976
point of interest in the centre @unoise@ in. Reality a, V
Considers sift as it is implemented. It even subdivides the
00:47:03.976 --> 00:47:13.360
area around the pixel of interest, not only into four, but
into sixteen different now. So it subdivides each of
00:47:13.360 --> 00:47:23.182
these areas, which are shown here again in four so
that we get in total, sixty so. And for each of these six,
00:47:23.182 --> 00:47:29.763
sixteen it calculates an orientation histogram with
eight entries, and then it all these orientation
00:47:29.763 --> 00:47:37.834
histograms so that we get sixteen times eight. It means,
in total, one hundred and twenty eight entries in the
00:47:37.834 --> 00:47:50.172
description factor that is use Yeah. So again, we
subdivide the area around the point, the pixel of interest into
00:47:50.172 --> 00:47:57.973
sixteen areas. For each, we calculate an orientationistogram,
then we those and we get the whole descriptor that
00:47:57.973 --> 00:48:07.917
is calculated by sifts. I have left away some details, but
just to give you the basic idea. So one question assume that
00:48:07.917 --> 00:48:20.241
it is located here in the center and at these @unoise@ so.
Of course you are right or if it would be perfectly aligned
00:48:20.241 --> 00:48:31.007
with an image that it could be either here or here or here
or here, somewhere. Yeah, @unoise@ but. In practice the,
00:48:31.007 --> 00:48:39.787
thing is rotated a little bit so that things do not fit
exactly to pixels these areas, but you are right. So in this
00:48:39.787 --> 00:48:48.028
case, we have some rounding to that closest Integer, something
like that @unoise@ okay. So, that yields a description
00:48:48.028 --> 00:48:56.760
that is quite powerful now, and it has shown to be in very
in respect to elimination, because it is only based on
00:48:56.760 --> 00:49:04.341
gradient information. In practice, it has shown to be very
characteristic for a point. So it is unlikely that different
00:49:04.341 --> 00:49:11.374
points create the same descriptor. Um, yeah. And for all those
who have attended the machine vision lecturer. Of course,
00:49:11.374 --> 00:49:19.235
this idea with the orientation is to Grum is very similar to
what you might know as hawk features, histogram of oriented
00:49:19.235 --> 00:49:28.301
gradients factory. The same idea sift is older than the hock
paper is so as if it was earlier and later on, it was given
00:49:28.301 --> 00:49:36.800
the name Hisogram of oriented radiant features. Okay, what
we aren't yet is we aren't yet scale invariant and we
00:49:36.800 --> 00:49:46.381
aren't yet a rotation in verand. The scale invariance comes
in by defining the size of the environment that we are using
00:49:46.381 --> 00:49:57.353
out from the detection step we know for each interest point,
a Sigma value that we ah on which um the the scale on which
00:49:57.353 --> 00:50:06.255
the maximum occurred. And we have already seen that the scale
is related to the size of the structures, which Ah, which
00:50:06.255 --> 00:50:15.071
occur in the image for large structures, the scale on which
we find the maximum is large, small structures. The scale on
00:50:15.071 --> 00:50:25.007
which we find the maximum is small, and therefore to get scale
in variant we argue, or we say that this area of interest
00:50:25.007 --> 00:50:33.700
around the point of interest which we consider to calculate
the descriptor um is the size of this is, uh, chosen
00:50:33.700 --> 00:50:42.293
according to the Sigma value proportional to the signal value.
If Sigma is small, then this area around the the interest
00:50:42.293 --> 00:50:51.179
point is chosen to be small, from which the descriptor is
calculated, while if Sigma is very large, then we also choose
00:50:51.179 --> 00:51:00.489
a very large size of this area from which we calculate this
descriptor. And by doing that again, we get in very end.
00:51:00.500 --> 00:51:12.036
This is back to the scale, because we only consider the scale
which on which the maximum was count. Now, that is the way
00:51:12.036 --> 00:51:19.840
how we are get scale invariant. The last question is, how do
we get rotation invariant. Of course, this procedure, as we
00:51:19.840 --> 00:51:28.023
have introduced it at the moment is not yet rotation invariant.
If we rotate the image, we all the gradients rotate as
00:51:28.023 --> 00:51:36.974
well are rotated as well. And that means the orientation is
to gram changes. Well, okay, so this ah doesn't work yet. So
00:51:36.974 --> 00:51:47.193
what can we do to get a rotation invariant? well, the
idea is to um say, first calculator kind of preferential
00:51:47.193 --> 00:51:55.812
direction for each interest point. The gray value structure
around the interest point and calculate a preferential
00:51:55.812 --> 00:52:05.176
direction. And then we virtually rotate the image back into
such a way that the preferential direction always say points
00:52:05.176 --> 00:52:13.882
to the right. And then we calculate the descriptor based on
this rotated inch. How does it work? well, how do we find
00:52:13.882 --> 00:52:21.462
this a preferential direction, we again calculate an orientation
histogram as we already have seen. So this might be the
00:52:21.462 --> 00:52:29.053
the gray level patch from which we calculate the want to
calculate the descriptor. Yeah, we calculate an orientation
00:52:29.053 --> 00:52:38.195
histogram. Now an orientation histogram with more bins
so. Yeah, the bins are not forty five degrees large, but
00:52:38.195 --> 00:52:47.157
ten degrees large, so that we get in total, sixty, thirty,
six pins. The orientation is to grammar. Now is plotted here
00:52:47.157 --> 00:52:53.858
a little bit differently, visualized a little bit differently
than on the slides before. So just as on the horizontal
00:52:53.858 --> 00:53:02.290
axis, we have the angle, the great end anger on the vertical
axis. We have the histogram entry, and now we select the
00:53:02.290 --> 00:53:09.799
maximum of this orientation histogram, and this yields our
preferential directions. So in this case, it might happen
00:53:09.799 --> 00:53:18.826
that the maximum is found here. And so this angle that refers
to this bin is chosen as a preferential direction based on
00:53:18.826 --> 00:53:28.355
this preferential direction. We rotate the image back, and
then afterwards we use this calculation step to calculate the
00:53:28.355 --> 00:53:34.742
descriptor. So the implementation is a little bit different
from what I have explained but the basic idea is what
00:53:34.742 --> 00:53:41.197
I have explained. So we always rotate the image in such a way
that say this this preferential direction points to
00:53:41.197 --> 00:53:50.143
the right after rotation. And then by doing that, we say
normalize the local vicinity of the point of interest with
00:53:50.143 --> 00:54:00.012
respect to rotation and get rid of rotation in vary and get
rotation invariant @unoise@ okay. So everything so the
00:54:00.012 --> 00:54:09.980
first step was we wanted to find well defined points of interest.
We were calculating the maxima in skate space, a local
00:54:09.980 --> 00:54:17.721
maximum afterwards. We were throwing away all the local
maximum, which are close to zero, because they are highly
00:54:17.721 --> 00:54:26.476
affected by noise um. We would want to get rid of them. And
afterwards we have thrown away all the maximum, which are
00:54:26.476 --> 00:54:34.888
close to corners, close to edges, and only keep those for
which this corner criterion, which are accepted by the corner
00:54:34.888 --> 00:54:44.604
criteria, um? so that is the first step, and none for all
those points which um which, uh, have survived this filtering
00:54:44.604 --> 00:54:54.772
um. We calculate first the preferential direction by calculating
a grey little histogram @unoise@ % um to eliminate the
00:54:54.772 --> 00:55:03.334
rotational in the rotation of the image. And then we calculate
the orientation histogram in these sub segments, and
00:55:03.334 --> 00:55:10.894
afterwards we all these features that we have calculated
all these entries of the orientation histograms in a
00:55:10.894 --> 00:55:21.605
large vector. And this vector than is a descriptor that is
used by sift @unoise@ oh. Yeah and this descriptor is a kind
00:55:21.605 --> 00:55:30.553
of very characteristic description of that point, and it can
be used to find the same point again in other images. So
00:55:30.553 --> 00:55:39.909
let us have a look at the result. So again, this is one
small area inside of the image that we have considered as
00:55:39.909 --> 00:55:51.583
example, throughout the lecture. Um where are we so these
are all the sift features, which are very are very
00:55:51.583 --> 00:56:01.986
salient. You know, the most dominant local maxima in scale
space, you know, just to show you that. So again, the center
00:56:01.986 --> 00:56:09.799
of each circle refers to the interest point. The radius
refers to the scale, the larger the radius is. The largest of
00:56:09.799 --> 00:56:17.445
Sigma value was on which the point was found. Again, we see
that large structures like this create large Sigma
00:56:17.445 --> 00:56:26.189
values, small structures, as we can see here, create small
signal values. And then what we can also see is this line
00:56:26.189 --> 00:56:35.058
here. This line is a preferential direction that has found
this dominant direction, and that is used to say, virtually
00:56:35.058 --> 00:56:45.261
rotate the image back and calculate the for some of
the points you see that there are two lines in these cases,
00:56:45.261 --> 00:56:53.568
and there are two dominant maximum orientation histories.
So it is not perfectly clear which was the real dominant
00:56:53.568 --> 00:57:03.768
direction. And in all those cases, is just creating two
different descriptors for the same point. This occurs, for
00:57:03.768 --> 00:57:10.404
instance, here. So here are two different dominant directions,
and therefore a fifth grades, two descriptors for that
00:57:10.404 --> 00:57:19.443
point as well. For instance, here is another case where we
have two descriptors for the same point @unoise@ yeah. Ah,
00:57:19.443 --> 00:57:27.306
if, you are interested in getting more feature points. Of
course, we can adapt the thresholds Ah, with which points are
00:57:27.306 --> 00:57:36.430
filtered. And then, for instance, we get those points as
additional feature points is well clearly seen at a corner
00:57:36.430 --> 00:57:45.648
corner. Again in highly textured areas, we find points, but
of course, homogeneous areas here. There are no points if we
00:57:45.648 --> 00:57:54.853
are interested in even more, and we lower the thresholds.
And so we do not filter out that many maxima. We get those
00:57:54.853 --> 00:58:04.543
points as well @unoise@ um, yeah? or even those points now
@unoise@ how. Many points we want to find depends on our
00:58:04.543 --> 00:58:11.342
application. If we are only interested in a few points, then,
of course, it is already sufficient to take them the ten
00:58:11.342 --> 00:58:20.338
or twenty or thirty most dominant points. But often we want
to create a lot of points, because while we do not know in
00:58:20.338 --> 00:58:29.178
advance which points we will be able to see again in another
image. So typically we get several hundred points from one
00:58:29.178 --> 00:58:37.404
inch. So now we can start to compare points and compare points
from different images. So here are two different images
00:58:37.404 --> 00:58:46.888
from the same scene, only the perspective from which we have
made the the pictures has changed a little bit. Um, now let
00:58:46.888 --> 00:58:56.565
us calculate the feature points for both images. Compare the
descriptors and um, yeah, make a line between those points
00:58:56.565 --> 00:59:07.068
which have the same or almost the same descriptors. And this
can be shown here can be seen here. So we see the picture
00:59:07.068 --> 00:59:17.446
points here. Yeah. And here as well as the green dots, and
then the green lines in between @unoise@ and. We see that
00:59:17.446 --> 00:59:28.374
those green lines most of the green lines are somehow parallel
so you might argue or be might argue that this is a
00:59:28.374 --> 00:59:38.389
binocular camera system, and therefore we can also conclude
that these are a people are alive. Sorry. What do I say? no.
00:59:38.400 --> 00:59:46.045
Sorry. Bullshit. Forget. Okay, then that these are more or
less parallel. And this indicates that the correspondences
00:59:46.045 --> 00:59:55.195
are likely to be correct. Um. So if you check some uh points
in detail. For instance, this point in this points, there
00:59:55.195 --> 01:00:03.799
really seem to be the same point. So this correspondence is
correct for these points as well. It seems that these two
01:00:03.799 --> 01:00:11.997
points are also correct, but obviously we also have some
mistakes. So here, for instance, if we look carefully. So both
01:00:11.997 --> 01:00:20.834
are in this textured, textured area. But if we look carefully,
we will find that this is below the N letter. He is N,
01:00:20.834 --> 01:00:30.624
and it is below that. And this is Bill. It be below an A
letter. So it is a point that looks similarly, but it is not
01:00:30.624 --> 01:00:42.689
exactly the same. So a certain amount of false um pairs
of points occur in practice, but most of them are found
01:00:42.689 --> 01:00:51.455
correctly. And then one strange point, this one here is
connected to this one here. And obviously it is not the same
01:00:51.455 --> 01:00:59.672
point. But if you look at the appearance of the point. We
must argue, yeah, the appearance is really very looks, very
01:00:59.672 --> 01:01:07.484
much the same, although it is not the same point. And therefore
this. Yeah, um. These two points have been aligned to
01:01:07.484 --> 01:01:22.940
each other, although they are not the same, but most of them,
of course, are correct. Yeah, so. So this is sift. Oh,
01:01:22.940 --> 01:01:32.894
sorry. There should be one more slide somewhere. Okay, then I
have removed it from the set of slides. Okay, so for which
01:01:32.894 --> 01:01:40.836
purpose can we use sift? well, we can use sift for binocular
reconstruction. If we are interested in finding the
01:01:40.836 --> 01:01:48.137
correspondences for, for the most dominant salient points.
We can use it for optical flow calculation, which will be
01:01:48.137 --> 01:01:57.088
discussed in the next chapter to find in which way points
we are moving from one image to the next image. We can use
01:01:57.088 --> 01:02:05.533
them to detect objects again. So um say we have a first
image, and we want to find, we have a completely different
01:02:05.533 --> 01:02:15.257
image. But one object occurs in those images, we might be
able to say, okay, we search all the image points that the
01:02:15.257 --> 01:02:24.604
feature points on this object from the first image and check
which of these feature points can be fired in another image
01:02:24.604 --> 01:02:33.951
so we can use it for object detection, for finding the same
object again, and all these things, and therefore they these
01:02:33.951 --> 01:02:42.004
methods we can use the feature points also as landmarks
for self localization in robotics. There is something that
01:02:42.004 --> 01:02:48.724
we will discuss in chapter six. And for all these reasons,
because there are very many applications for feature point
01:02:48.724 --> 01:02:56.245
methods. They have become quite popular. Sift was the first
feature point method that really became popular. That was
01:02:56.245 --> 01:03:03.374
very successful, that had all these different components
connected to each other, combined with each other. But
01:03:03.374 --> 01:03:10.753
meanwhile, other methods have been developed which have
some properties or some other, some other properties, some
01:03:10.753 --> 01:03:18.423
better properties in sift. And he is a collection, not a
full collection of feature point methods, but some feature
01:03:18.423 --> 01:03:26.198
point methods, which you find in computer vision, just to
show you that there is really a large variety of different
01:03:26.198 --> 01:03:34.455
methods that have been developed so far. So as we, when we
start so the first matter see as if the scale invariant
01:03:34.455 --> 01:03:42.600
feature transform first published in nineteen nine. Nineteen
ninety nine. So it is twenty years old, um it implements a
01:03:42.600 --> 01:03:51.732
detector and a descriptor. Now we have a method to find salient
points in a method to calculate the descriptor and its
01:03:51.732 --> 01:03:59.347
invariant respect to scale, brightness, rotation. It is not
real time capable. Yeah. So if you want to apply Allive
01:03:59.347 --> 01:04:07.678
features from one image. It requires some time, maybe half
a second or so, therefore, normal image, which we would
01:04:07.678 --> 01:04:16.929
consider not a real time % um a capable method
@unoise@ then the next one will serve speeded up robust
01:04:16.929 --> 01:04:25.698
features. Or you see, the main focus was on making the method
more computational efficient it uses. It is works a little
01:04:25.698 --> 01:04:34.888
bit differently than sift in detail, but it follows similar
ideas, um it also provides a detector in the scripture um
01:04:34.888 --> 01:04:43.991
and the real yeah, the computational time for surf is better
than sift for small images. It is real time for larger
01:04:43.991 --> 01:04:52.134
images. It is not real time then century centers around
extremers. It is only a detector, so it doesn't come with
01:04:52.134 --> 01:05:00.203
a description. However, you can combine this detector step
with a descriptor from sift or with a descriptor from so
01:05:00.203 --> 01:05:09.214
forth. Since, if you like no, you have the choice to combine
things also scale in very end, brightness in very end, a
01:05:09.214 --> 01:05:17.418
little bit of rotation in very end and real time, capable
not from two thousand and eight, then fast features from
01:05:17.418 --> 01:05:27.299
accelerated segment test. Also, just a detector, not a
descriptor @unoise@ Harry's corners this is actually the Stephen
01:05:27.299 --> 01:05:37.852
Harris corner detector, very old from nineteen. Eighty eight
already not scale, but that much scale invariant @unoise@
01:05:37.852 --> 01:05:48.106
um, but brightness and rotation in very end. And a very old
method, which can also be seen as a detector for feature
01:05:48.106 --> 01:05:55.870
point methods now, which can also be used for that purpose,
then another method, good features, to track its cult and
01:05:55.870 --> 01:06:04.787
are also popular. A quite popular method from nineteen
ninety four, which can be used as a detector method, an orb
01:06:04.787 --> 01:06:13.017
oriented, fast and robust brief that combines two things,
namely this fast detector and a descriptor method, which is
01:06:13.017 --> 01:06:21.247
called brief, which is seen here combines detector in the
scripture, obviously from two thousand and eleven and brisk
01:06:21.247 --> 01:06:30.541
the method from two thousand and eleven. Brief is a just a
descriptor method, not a detector method. So you combine it
01:06:30.541 --> 01:06:40.331
with one of the other methods which provide the detector from
ten, ten and, of course, a block of grave all use that is
01:06:40.331 --> 01:06:48.931
used for block matching could also be seen as a descriptor
method, but of course not with really nice properties. Um
01:06:48.931 --> 01:06:57.960
therefore only used in some approaches where we do not need
scale and brightness and rotation in variants. For instance,
01:06:57.960 --> 01:07:08.633
in binocular vision @unoise@ okay. So that summarizes this
chapter on feature point methods @unoise@ just. To summarize
01:07:08.633 --> 01:07:19.242
what did we discuss well, we discuss sift we
discussed some basic ideas that are implemented by sift. I
01:07:19.242 --> 01:07:28.965
didn't provide you all the details about sift, because if it
actually is very complicated, if you go into details, and
01:07:28.965 --> 01:07:38.343
it is very much tuned. So you might ask, why do we use
the lodge function and not another function that is also
01:07:38.343 --> 01:07:47.991
invariant with to to um to illumination. Why do we use this
of that method? the answer is because it works, because it
01:07:47.991 --> 01:07:56.867
has been tested in on many different images, and it has shown
to work. No one can explain why exactly this combination
01:07:56.867 --> 01:08:05.113
works a well, but now they empirical. There is some empirical
evidence that this combination works. And if you change
01:08:05.113 --> 01:08:13.605
something, it might not work that well now. And as I said,
there are even more details in implementing the descriptor
01:08:13.605 --> 01:08:23.002
and the detector step that I didn't go into detail in order
not to confuse you too much. If you are really interested
01:08:23.002 --> 01:08:35.926
what sift is, then you have to read the original paper, or
even look at the implementation of sift, then you get all the
01:08:35.926 --> 01:08:44.520
details @unoise@ um, Yeah. But however, what is important
to remember is this idea of a scale space of higher finding
01:08:44.520 --> 01:08:53.977
maxima in a scale space, how the scale space is built, why a
scale space makes sense now that there is some in the
01:08:53.977 --> 01:09:02.664
between and and changing the Sigma, the parameter
important. It is to know how the preferential direction can be
01:09:02.664 --> 01:09:13.530
calculated, or how a description in general is built. Now,
based on that, we have seen some other um, at least a list of
01:09:13.530 --> 01:09:22.130
some other feature point methods, which can be used as
alternatives to sift, and most of them are more efficient,
01:09:22.130 --> 01:09:32.585
computational, more efficient and therefore are used in modern
implementations very much @unoise@ okay. That, is ah the
01:09:32.585 --> 01:09:44.412
chapter on methods. So still, we have twenty minutes.
So therefore, that is one question yet. Well, as you like
01:09:44.412 --> 01:09:52.239
you, you are free to choose a a matrix as you like.
Now, you can use something like a Euclidian distance between
01:09:52.239 --> 01:10:00.693
the feature vectors, or you can use a Manhattan distance,
or you can use a cosine uh matrix as you like. Test it for
01:10:00.693 --> 01:10:14.422
your application, which works best. Mhm, okay. Then let us
go to the next chapter. So the next chapter is an optical
01:10:14.422 --> 01:10:22.526
flow in image base tracking. So again, we want to compare
images again and find out in which way the conduct of images
01:10:22.526 --> 01:10:30.098
barries from one image to the other image in binocular vision.
We said," Okay, we have two images, which are taken at
01:10:30.098 --> 01:10:38.201
the same point in time, which show the same scene. And we
want to compare where is the point in the left image compared
01:10:38.201 --> 01:10:47.380
to the right immature. What is the disparity @unoise@ in the
future point message we said we get some images. We don't
01:10:47.380 --> 01:10:54.521
know when they have been taken, where they have been taken.
But we want just to compare whether we find corresponding
01:10:54.521 --> 01:11:01.739
points in it, whether there are some feature points which we
find again in the other image in optic, a flow calculation.
01:11:01.750 --> 01:11:10.678
We argue, okay, we want to analyze a sequence of images. So
we assume that we get images from the same camera, which are
01:11:10.678 --> 01:11:20.022
taken over time at different points in time. And we want to
know in which way the image content has changed, or in other
01:11:20.022 --> 01:11:30.132
words, in which way the objects that we see have moved from
one image to the next one. Ah, yo. The reference list is Ah,
01:11:30.132 --> 01:11:39.080
seen here. So of course, the standard textbooks on computer
vision should all more or less have a chapter on @unoise@
01:11:39.080 --> 01:11:48.247
optic flow calculation @unoise@ the last two images the last
two. Sorry. The last two papers are original papers. So the
01:11:48.247 --> 01:11:57.251
horn and shrimp paper is one paper about um about a so called
variational method for optical flow calculation, which was
01:11:57.251 --> 01:12:05.491
the first one that was doing this variation approach. We touch
it within this lecture without going into details. If you
01:12:05.491 --> 01:12:13.667
are interested, just consider this paper to get some basic
idea. And the last paper is about the the second part of this
01:12:13.667 --> 01:12:22.153
chapter, where we are not talking about % um optical flow
calculation, but where we want ask, how can we find the same
01:12:22.153 --> 01:12:30.638
object in a sequence of images, without going down to a pixel
level or on original papers, or if you are interested in
01:12:30.638 --> 01:12:38.521
details of the method that we discuss at the end of this
chapter, you might consider this paper. So what is optical
01:12:38.521 --> 01:12:46.568
foam? so far as I said, we were analyzing either individual
images or individual gray level functions, or pairs of gray
01:12:46.568 --> 01:12:54.385
level functions in binocular construction. Now we assume a
video sequence. That means a sequence of images over time. So
01:12:54.385 --> 01:13:04.115
the function that we consider now is a gray level function
that depends on the pixel position, you and V, but also on a
01:13:04.115 --> 01:13:11.796
additional variability, which indicates the time at which
the image was taken. And we assume that the time is just
01:13:11.796 --> 01:13:19.890
counted on it with Integer. So the first image, second image,
third emits of tea is equal zero. One, two, three or four,
01:13:19.890 --> 01:13:27.142
et cetera. So, and we want to examine the changes in the
images over time. And um, these changes might be caused either
01:13:27.142 --> 01:13:34.693
by the eagle motion of the cameras, or if we walk around with
the camera or drive around to the camera, of course. Um,
01:13:34.693 --> 01:13:41.789
the environment moves Missus back to the camera, or the camera
moves respect to the environment doesn't matter though.
01:13:41.800 --> 01:13:48.433
This causes changes in the image, or of course, there might
also be changes in the environment, independent of the
01:13:48.433 --> 01:13:54.855
movement of the camera. They also cause changes. That is also
what we are interested in. So the relative change between
01:13:54.855 --> 01:14:03.286
the environment and the camera. That is what we are interested
in. And we sorry, what is that? and we want to observe
01:14:03.286 --> 01:14:13.005
that on the level of pixels, on a pixel level. So at this
point, we do not talk about @unoise@ the velocity of objects
01:14:13.005 --> 01:14:20.901
in the three dimensional world. But we are only talking
about, where did an object occur in the image, in pixel
01:14:20.901 --> 01:14:28.403
coordinates in the last image, and where does it occur in
the subsequent image? now we are talking about movements in
01:14:28.403 --> 01:14:36.093
the image, not yet about movements in the sea. The optical
flow that started with the definition of is the
01:14:36.093 --> 01:14:43.549
apparent shifting of any point in an image caused by a
relative movement between the camera and the observed object.
01:14:43.560 --> 01:14:54.401
Well, it is important to say is a shift in the image,
not in the three dimensional world, but in the image. So
01:14:54.401 --> 01:15:02.395
here are two images, and I just show them one after the other
for them back. And we clearly see that some structures
01:15:02.395 --> 01:15:10.743
change in the image due, in this case due to the ego motion
of our vehicle. And of course, this vehicle here in front is
01:15:10.743 --> 01:15:19.591
also moving. So it also causes some optical floor. Now in this
shifting of points that we can see here from one image to
01:15:19.591 --> 01:15:28.306
the other one that is called the optical floor. And we
calculate, we expect that we can calculate the optical flow for
01:15:28.306 --> 01:15:35.633
all pixels, or for almost all pixels so well. Therefore,
we distinguish two things in literature. A dense flow and
01:15:35.633 --> 01:15:43.120
sparse floor. Dense floor means we want to determine this
movement of each pixel in the image, or almost each pixel in
01:15:43.120 --> 01:15:52.093
the image. And Sparseflow means, while we are not interested
for all pixels, but only for some pixels in the image to
01:15:52.093 --> 01:15:59.796
determine its movement now. And so, for instance, only for
salient points, only for feature points here, then we might
01:15:59.796 --> 01:16:07.021
get maybe for such an image. One hundred two hundred three
hundred pixels for which we calculate the optical flow in the
01:16:07.021 --> 01:16:15.951
dense flow case. We want to get optical floor for at least
eighty or ninety percent of all pixels. So which flow do we
01:16:15.951 --> 01:16:25.132
expect now? especially when the lecture is called .
That means which flow do we expect um, when we are driving a
01:16:25.132 --> 01:16:35.017
car. Let assume we have a camera mounted behind the windscreen
of the car looking to the front, and which optical flower
01:16:35.017 --> 01:16:44.083
typically occurred occurs, which optical flow um can we expect?
so the first thing is that lets us hear we are driving
01:16:44.083 --> 01:16:54.287
on a plan. Our structure on a flat . Am completely flat
and we are just driving forward. So which optical flow do
01:16:54.287 --> 01:17:04.595
we expect? well, the result of an application is can be seen
here now. So these each of these lines here refers to one
01:17:04.595 --> 01:17:13.265
optical floor vector. So the rat point is actually the pixel
for which the optical flow is calculated. And yeah, this
01:17:13.265 --> 01:17:22.408
line could be interpreted as an arrow that tells us from which
point to which other point the points move, though. This
01:17:22.408 --> 01:17:33.497
point moves from here to hear from from the rat and to the
not rent end of the line Yeah. So this is very much as
01:17:33.497 --> 01:17:42.085
you know, it may be from computer games. Yeah, some stars
ship um simulations. Yeah, you are approaching and then the
01:17:42.085 --> 01:17:54.030
points are somehow um, move like that. Yeah, once you go
straight forward. Yeah, this is this kind of Starship um
01:17:54.030 --> 01:18:03.561
simulation style ah pitching Ah. That
means," Ah, you are not moving Ah, with a car. That means you
01:18:03.561 --> 01:18:11.388
are not moving in forward direction, but you are just standing,
but then someone is pushing your car um up and down. So
01:18:11.388 --> 01:18:21.690
to say so that it is somehow pitching. And then the Ah!
optical floor field looks like that. Yeah. Well, a little bit
01:18:21.690 --> 01:18:34.682
differently than but we have rolling. So if the car rolls,
so it goes. In this case it goes. Let me think @unoise@ on
01:18:34.682 --> 01:18:43.931
the left side of the coral. It is pushed down on the right
side of the car. It is pushed up. Then we get such a
01:18:43.931 --> 01:18:53.857
structure. And this Francis is driving in a curve. So combining
a movement forward with a rotation around the vertical
01:18:53.857 --> 01:19:03.544
axis of the car, and then it is an extreme case here. But
this is the optical flowfield, which we get when we make a
01:19:03.544 --> 01:19:13.103
very uh, when we drive in a curve with a very high curvature.
So these are typical images that occur when we calculate
01:19:13.103 --> 01:19:26.992
optical flying. These scenes now Um. Yeah, what can
we say? what can we see? so very typical things are areas
01:19:26.992 --> 01:19:37.194
which are near by create larger optical flower factors and
areas which are far away, actually in the horizon for such a
01:19:37.194 --> 01:19:47.045
movement stress driving straight at the horizon, we observe
an optical flow of zero points are not moving at all. Um,
01:19:47.045 --> 01:19:58.984
yeah. But points which are very, very nearby they create
very strong optical floor. Okay, so now this images that
01:19:58.984 --> 01:20:10.400
we have just shown were calculated by simulating a movements
or simulating a world simulating a movement of the camera.
01:20:10.420 --> 01:20:18.584
And then I was calculating which optical flows do we expect?
so this is not what we normally want to do. Normally, we
01:20:18.584 --> 01:20:25.193
have two images from a sequence of images from A, and we want
to calculate the optical floor just based on these images,
01:20:25.193 --> 01:20:31.940
we do not know how the camera has moved. We do not know how
the object in the scene have moved. But we only have these
01:20:31.940 --> 01:20:40.141
images, and we want to calculate the optical flow. What
can we do? so we make two assumptions. One assumption is
01:20:40.141 --> 01:20:46.210
constant illumination. And another assumption is that the
target point, the point for which we want to calculate the
01:20:46.210 --> 01:20:54.382
optical flow is not hidden in one of these images. Of course,
if it is hidden, if it is in one of those images, we
01:20:54.382 --> 01:21:04.230
cannot calculate the applicant flow because we cannot see
it. Well, let us assume that the point is visible in both
01:21:04.230 --> 01:21:16.580
images, and that it shares the same elimination based on that,
we can do some calculations @unoise@ oops. Yeah, we say
01:21:16.580 --> 01:21:28.269
the point that we for which we want to calculate the the optic
flow has the same grey value in both images. That means
01:21:28.269 --> 01:21:37.221
the point is in one image located at position. You be in the
first image, and in the second image, it is located at a
01:21:37.221 --> 01:21:45.230
position you plus Delta V be plus elderly and delta tea should
be, in our case one, we are considering the next image.
01:21:45.239 --> 01:21:55.011
So, yeah, though, this is a property which must hold based
on the first assumption of constant illumination. The point
01:21:55.011 --> 01:22:04.241
of interest has the same grey Valley in both images. And Anna
Delta, you Delta V is the optical floor. This factor that
01:22:04.241 --> 01:22:12.706
describes the movement of the point from one image to the
other. This is a necessary condition, but it is not yet
01:22:12.706 --> 01:22:21.129
sufficient. So there are still many points for which this
equality holds. But however, let us start with this equation.
01:22:21.140 --> 01:22:28.614
So the first grow here is, again, the same equation. But now,
just in resolving it with such a way that on the left hand
01:22:28.614 --> 01:22:35.947
side we have Cyril. So zero is equal to this difference of
gray values in the first image in the second image. Now, what
01:22:35.947 --> 01:22:44.803
we do is a tailor approximation of the first term are a first
order tailor approximation, which is then equal to G of U,
01:22:44.803 --> 01:22:52.701
B, T plus delta. You times the first order derivative of the
grey value function. We suspected you and times the
01:22:52.701 --> 01:22:59.722
first order partial derivative respect to be passed out at
tea times. The first or first order, part of derivative
01:22:59.722 --> 01:23:11.795
aspect of tea. Well, in the minus G of Yu B, T is preserved.
Now we see that we have G of Uv T here, an minus key of Uv
01:23:11.795 --> 01:23:20.403
T here. And in the same formula, though, this Rem yields
zero. And furthermore, we assume that we are considering
01:23:20.403 --> 01:23:30.924
subsequent images. That means that this delta T value here
is equal to one. I'm sorry, it is equal to one. And then we
01:23:30.924 --> 01:23:43.407
get this equation here. That means and this is again,
a necessary equation, which must be met for, for, for
01:23:43.407 --> 01:23:52.814
calculating the optical floor. So if Delta V is a too
optical flow vector than this equation must hold. And this
01:23:52.814 --> 01:24:01.420
equation has a name, its called motion constrained equation.
Now it is relating the motion of a point delta. You Delta
01:24:01.420 --> 01:24:10.586
V, the optical floor of the point. It is relating it to the
change of grave la value structure, which can be found here
01:24:10.586 --> 01:24:19.091
in the derivative of Jeeves back to you and V and to the
temporal change of the gray value structure over time, the
01:24:19.091 --> 01:24:31.186
derivative of G was neck to tea. Yeah, so now % um when we
have an image. How can we calculate this term here? the
01:24:31.186 --> 01:24:41.573
of the grave value function, respect to you. How do we do
it? Mhm, we have suggestions. How would you do it? cake led
01:24:41.573 --> 01:24:54.156
the first or later. If expected, you, we are given and
grey value image. Yeah, which one your filters is
01:24:54.156 --> 01:25:04.823
correctly. Which one? Yeah, Mhm, private. Yes or sober.
Yeah, okay. So we already know how to do that. We can use a
01:25:04.823 --> 01:25:13.956
sober or private filter failed as an image. And then we get
this term here. And of course, also this term this here. How
01:25:13.956 --> 01:25:22.465
can we calculate that one? the partial derivative of the gray
value function, respect to the T teas, a time index. Well,
01:25:22.465 --> 01:25:31.824
that means it refers to the different images in the
sequence of images. How could we calculate that? Yeah, just
01:25:31.824 --> 01:25:41.363
subtractive images. That is a good idea here. Just subtract
the images. We approximate this derivative by just
01:25:41.363 --> 01:25:51.389
calculating the difference of the grey values of subsequent
images. Now, that is a valid approximation of, um, this
01:25:51.389 --> 01:25:59.765
derivative question. Okay, the question is whether we made
an assumption that this term " ah " remains constant. The
01:25:59.765 --> 01:26:10.215
same have we made the assumption that a point that we observe
has the same brightness over time, but the point moves
01:26:10.215 --> 01:26:22.484
within the image. Mhm yeah. So we observe a car that is moving.
We assume that the brightness with which we perceive the
01:26:22.484 --> 01:26:30.807
car remains the same over time, but the car is moving. That
means changing its position. That means we see it at a
01:26:30.807 --> 01:26:39.819
different position. Mhm, that is the assumption that we made.
So the brightness with which we see a three dimensional
01:26:39.819 --> 01:26:47.976
point remains the same. But the sweeter mentioned a
point might move respect to the camera, and that means it
01:26:47.976 --> 01:26:55.023
might appear a different position of the image. While here,
for calculating this partial derivative, we keep the
01:26:55.023 --> 01:27:03.522
position the same. Now we consider the same position in the
image, the same actually the same pixel position. And this
01:27:03.522 --> 01:27:12.449
might change because the scene might have moved. Now that
is the important thing. And if you just calculate the
01:27:12.449 --> 01:27:21.439
difference of two grey value images. We keep the position of
the pixels where they are, and just calculate the change of
01:27:21.439 --> 01:27:31.197
brightness at each pixel position. Okay, so this is shown
here. So, ah, you already said, so will private filter mass
01:27:31.197 --> 01:27:40.068
for calculating these partial derivatives, respect to you
and be, and calculating the difference of the gray values,
01:27:40.068 --> 01:27:49.432
pixel by pixel, to approximate the relative suspect of time,
and maybe one last light. This illustrates this motion
01:27:49.432 --> 01:27:59.732
constrained equation for the one dimensional case of an
image that consists just of one row where we omit the V
01:27:59.732 --> 01:28:11.200
coordinate, say the sine curve is the original image in point
in time tea. The blue curve is the shifted a gray value
01:28:11.200 --> 01:28:19.754
function. One step later, one time step later, you is a
position of interest. Now, this shift here, the spacious shift
01:28:19.754 --> 01:28:30.210
is delta you, which we are interested in. The part derivative
respect to tea is the difference of gray value. So it is a
01:28:30.210 --> 01:28:41.241
difference from this value and this value. So it is indicated
by this error and the derivative respect to the spatial
01:28:41.241 --> 01:28:50.644
dimensions actually describes the slope of this tangent. And
if we shift the slope, this tangent a little bit. We see
01:28:50.644 --> 01:28:57.579
that there is a triangle. And from this triangle, we can
actually derive this relationship. And based on this
01:28:57.579 --> 01:29:04.791
relationship, if we resolve it, we get the motion constrained
equation for this one dimensional case. Ca so that is it
01:29:04.791 --> 01:29:10.299
for today. Thank you and
let us see you again in one week