WEBVTT
00:05.000 --> 00:10.860
So, now we can go on with the lecture topic.
00:11.640 --> 00:15.200
So, now you can learn all the techniques that you need if you want to
00:15.200 --> 00:18.120
participate in these groups, so to say.
00:19.420 --> 00:25.160
Yeah, last time we started with a brief repetition of machine vision
00:25.160 --> 00:31.160
or with some basic concepts in machine vision, so that you have some
00:31.160 --> 00:33.540
basic knowledge that we need for the lecture.
00:34.040 --> 00:40.380
And we discussed the convolution operator with which we can manipulate
00:40.380 --> 00:43.160
an image in a certain way.
00:43.600 --> 00:49.920
One filter mask that is very useful and used very often is a Gauss
00:49.920 --> 00:53.040
filter that is also something that we already discussed in the
00:53.040 --> 00:56.580
lecture, with which we can blur images.
00:56.580 --> 01:02.820
And here you see an example of how an image is blurred using a
01:02.820 --> 01:03.480
Gaussian filter.
01:03.920 --> 01:09.300
Then we said, okay, blurring or smoothing the gray values of an image
01:09.300 --> 01:11.200
is one operation that we need.
01:11.560 --> 01:15.700
Sometimes we also need other operations which we can implement with a
01:15.700 --> 01:17.420
filter mask, with a convolution.
01:17.920 --> 01:23.080
And the second important thing that we need to do is calculating
01:23.080 --> 01:23.760
derivatives.
01:24.320 --> 01:29.780
So, a gray value image is interpreted, can be interpreted by us as a
01:29.780 --> 01:36.280
function that takes as input the coordinate of a pixel, the row and
01:36.280 --> 01:41.380
column of a pixel as input and that yields as output the gray value.
01:41.700 --> 01:46.760
So, if we interpret that as a function, then we say, okay, we might
01:46.760 --> 01:50.120
also be interested in the first order derivative of this function.
01:50.960 --> 01:57.340
And we have seen that with two kind of filter masks, we can
01:57.340 --> 02:03.280
approximate, at least approximate, the derivative, the first order
02:03.280 --> 02:05.120
derivative of the gray value function.
02:05.500 --> 02:10.920
So, these are the two basic filter masks that are shown here to
02:10.920 --> 02:15.240
calculate the partial derivative in horizontal direction and in
02:15.240 --> 02:16.060
vertical direction.
02:16.380 --> 02:20.600
In practice, people prefer to use these filter masks, the Sobel filter
02:20.600 --> 02:23.160
masks, which are a little bit extended filter masks.
02:23.400 --> 02:29.460
They also calculate the gray level, the partial derivative of the gray
02:29.460 --> 02:29.820
value.
02:30.020 --> 02:34.240
However, they are a little bit more robust with respect to noise in
02:34.240 --> 02:35.420
the image.
02:35.580 --> 02:40.420
That means they do a little bit of smoothing together with the
02:40.420 --> 02:41.620
calculation of the derivative.
02:42.520 --> 02:44.800
So, here are the example images.
02:46.100 --> 02:50.620
Now, the result of the filtering operation is color-coded.
02:51.300 --> 02:57.700
So, zero output is green and large positive outputs are red and large
02:57.700 --> 02:59.100
negative outputs are blue.
02:59.240 --> 03:04.100
And we can see that with this calculation of the partial derivatives,
03:04.700 --> 03:09.120
those structures are highlighted where we have strong contrast between
03:09.120 --> 03:11.640
dark and bright areas.
03:11.940 --> 03:17.220
So, for instance, here next at the boundaries of this marking in front
03:17.220 --> 03:23.580
of the vehicle, we have strong contrast, strong gray level edges and
03:23.580 --> 03:24.840
they are highlighted.
03:25.160 --> 03:29.200
While in the homogeneous areas, like in the sky or here on the ground,
03:29.760 --> 03:31.700
the output of the filter is zero.
03:32.440 --> 03:36.800
Now, this is the output for the partial derivative in the horizontal
03:36.800 --> 03:37.340
direction.
03:37.340 --> 03:42.120
On the left-hand side and on the right-hand side, it's the output of
03:42.120 --> 03:46.100
the filter mask that calculates the derivative with respect to the
03:46.100 --> 03:46.960
vertical direction.
03:47.160 --> 03:51.560
And what we can see is, if we calculate the horizontal partial
03:51.560 --> 03:57.640
derivative, vertical gray level edges are highlighted.
03:58.980 --> 04:04.300
While if we calculate the partial derivative in vertical direction,
04:04.620 --> 04:08.780
horizontal structures are highlighted, while completely vertical
04:08.780 --> 04:10.760
structures disappear.
04:12.280 --> 04:17.380
Okay, so if you talk about a first-order derivative, why not talk
04:17.380 --> 04:19.060
about a second-order derivative?
04:19.640 --> 04:23.540
Of course, this might also be relevant and interesting for us.
04:24.180 --> 04:28.520
If we say a second-order derivative of the gray value function, what
04:28.520 --> 04:29.200
is it?
04:29.200 --> 04:32.040
And maybe you remember that from your math class.
04:32.220 --> 04:36.720
If you have a function that takes two input values, two variables as
04:36.720 --> 04:41.460
input and a real number as output, then the second-order derivative is
04:41.460 --> 04:42.240
a Hessian.
04:42.620 --> 04:45.360
A Hessian is a matrix, a two-by-two matrix.
04:45.600 --> 04:49.100
It's a metric, so it has three different entries.
04:50.160 --> 04:54.680
Namely, there is the second-order partial derivative with respect to
04:54.680 --> 05:00.380
the horizontal direction twice, then the mixed second-order derivative
05:00.380 --> 05:06.200
with respect to the horizontal, and once with respect to the vertical
05:06.200 --> 05:10.560
direction, and the second-order partial derivative with respect to the
05:10.560 --> 05:11.860
vertical direction twice.
05:12.320 --> 05:17.140
So indeed, we would need three filter masks to implement these three
05:17.140 --> 05:18.800
second -order partial derivatives.
05:19.660 --> 05:24.400
And we can derive it, so filter masks that are appropriate are shown
05:24.400 --> 05:25.660
here on the right-hand side.
05:26.180 --> 05:31.280
So if we want to calculate the full Hessian of the gray level
05:31.280 --> 05:33.720
function, we can take these filter masks.
05:34.160 --> 05:38.820
However, often we are not interested in the full second-order
05:38.820 --> 05:42.460
derivative, but in something that is called the Laplace operator,
05:42.900 --> 05:48.220
which is just the sum of the two second-order partial derivatives.
05:48.580 --> 05:52.180
The first one that is taken twice with respect to the horizontal
05:52.180 --> 05:58.580
coordinate axis u, and the other one that is taken twice with respect
05:58.580 --> 06:02.080
to the second coordinate axis v.
06:02.960 --> 06:06.300
This Laplace operator has some nice properties.
06:07.220 --> 06:13.040
It's invariant with respect to the rotation of the image, that's one
06:13.040 --> 06:18.980
of the properties, and it provides us somehow, say, a scalar number
06:18.980 --> 06:26.060
that tells us whether the second-order derivative is somehow strong or
06:26.060 --> 06:26.980
not that strong.
06:27.600 --> 06:31.580
The information that we lose, especially this information of the mixed
06:31.580 --> 06:37.240
term, tells us something about the orientation of gray level edges.
06:37.380 --> 06:41.520
But if you are not that much interested in this orientation, we might
06:41.520 --> 06:46.660
end up with a Laplace operator that provides us something about the
06:46.660 --> 06:49.400
magnitude, so to say, of the second-order derivative.
06:50.360 --> 06:53.180
And for that, there is also a filter mask.
06:53.620 --> 06:59.900
This filter mask here, for instance, that is just the addition of this
06:59.900 --> 07:02.540
filter mask and this filter mask here.
07:02.740 --> 07:06.320
And what we get then is this Laplace filter mask.
07:07.660 --> 07:13.580
Sometimes we prefer to combine the Laplace operator with the Gaussian
07:14.260 --> 07:15.180
filter mask.
07:15.840 --> 07:21.140
The problem with taking the derivative is that derivatives are very
07:22.140 --> 07:27.060
sensitive to noise in the images, and that means if you just take a
07:27.060 --> 07:33.400
raw image and calculate the Laplace operator on the raw image, then
07:33.400 --> 07:36.520
the result is highly affected by noise in the image.
07:36.860 --> 07:41.100
If you want to reduce this influence, we can combine the Laplace
07:41.100 --> 07:45.580
operator with a Gaussian filter mask, and then what we get is called a
07:45.580 --> 07:49.760
Laplacian of Gaussian, which can be understood as first we apply the
07:49.760 --> 07:54.180
Laplace operator on the image and afterwards a Gaussian filter.
07:54.500 --> 07:56.640
Or vice versa, the order doesn't matter.
07:57.100 --> 08:01.560
Or we first combine the Laplacian with a Gaussian, and we get a larger
08:01.560 --> 08:05.780
filter mask, which is called an LOG, Laplacian of Gaussian, and then
08:05.780 --> 08:09.640
we filter the original image with this Laplacian of Gaussian filter
08:09.640 --> 08:10.040
mask.
08:11.300 --> 08:14.960
So, the result of such an operation can be seen here.
08:15.080 --> 08:21.300
Again, the result is color-coded, so please mind that here zero result
08:21.300 --> 08:23.100
is encoded with yellow.
08:23.800 --> 08:27.300
And what we can see, the second order derivative, of course, is zero
08:27.300 --> 08:31.500
in all the homogeneous areas, say in the sky, for instance, or here on
08:31.500 --> 08:31.960
the ground.
08:32.460 --> 08:36.320
But again, in those areas where we have strong contrast, local
08:36.320 --> 08:42.140
contrast between bright and dark areas, we get an output of this
08:42.140 --> 08:44.360
Laplacian filter mask, which is non-zero.
08:44.820 --> 08:49.020
But actually, if we look very carefully, we will see that close to
08:49.020 --> 08:53.540
these gray level edges, there is not only one red line, which would
08:53.540 --> 08:59.040
indicate large positive result of the LOG, but there is also a green
08:59.040 --> 09:03.920
line, which indicates a negative derivative in more or less the same
09:03.920 --> 09:04.920
magnitude.
09:06.300 --> 09:08.300
And that is very typical.
09:08.480 --> 09:12.060
So, what we can find in the second order derivative, if we calculate
09:12.060 --> 09:18.680
in the Laplacian of Gaussian filtered image, is that at a gray level
09:18.680 --> 09:23.960
edge, we get two peaks, so to say, of the LOG filter, and a zero
09:23.960 --> 09:25.600
crossing in between.
09:26.300 --> 09:30.080
And then the zero crossing is actually the position of the gray level
09:30.080 --> 09:30.440
edge.
09:32.500 --> 09:41.640
Okay, now that we have started with these derivative filters, we can
09:41.640 --> 09:45.720
continue and ask how can we use them for this edge detection.
09:46.340 --> 09:51.980
So, say we have a gray level image, and we want to get the boundaries
09:51.980 --> 10:00.780
of areas which are bright on a dark background, or vice versa, which
10:00.780 --> 10:02.860
are bright on a dark background.
10:05.660 --> 10:10.800
And for this purpose, there are several algorithms which have been
10:10.800 --> 10:12.980
invented in computer vision.
10:14.540 --> 10:20.580
One of those is the so-called KENI algorithm or KENI operator, and it
10:20.580 --> 10:22.940
uses the
10:26.400 --> 10:30.140
derivative filters in order to detect the edge pixels.
10:30.300 --> 10:30.940
What does it do?
10:30.940 --> 10:37.220
Well, it first filters the input image with a Sobel filter mask to get
10:37.220 --> 10:42.120
a little bit rid of noise in the image, then afterwards it uses,
10:44.360 --> 10:50.000
sorry, the first step is filtering the image with a Gaussian filter
10:50.000 --> 10:52.420
mask to get a little bit rid of noise.
10:52.860 --> 10:59.540
The second step is using the Sobel filters in order to calculate the
10:59.540 --> 11:01.760
derivatives, the first order derivatives.
11:02.460 --> 11:03.860
And then, what do we do?
11:03.960 --> 11:08.400
We want to find the maxima, the local maxima, in the first order
11:08.400 --> 11:09.240
derivatives.
11:09.700 --> 11:10.880
How is it done?
11:11.360 --> 11:16.180
Well, the first thing is we go through the whole image, we check for
11:16.180 --> 11:21.240
each pixel, do we find neighboring pixels with a stronger response of
11:21.240 --> 11:23.140
the derivative filters.
11:23.140 --> 11:31.880
If yes, then we know this is not a local maximum of the gradient
11:31.880 --> 11:33.040
image.
11:33.820 --> 11:39.400
And if no, then we say, okay, this pixel seems to be a candidate for
11:39.400 --> 11:40.060
an edge pixel.
11:40.780 --> 11:46.420
And then the last step is that we do some thresholding, a little bit
11:46.420 --> 11:51.980
more complicated thresholding process in order to filter out noise and
11:51.980 --> 11:55.860
only get structures which are really salient, which are really salient
11:55.860 --> 11:56.500
edge pixels.
11:57.680 --> 12:00.440
So, just to give you this basic idea.
12:01.140 --> 12:07.140
So, now, if we take a picture, say, of the main building of KIT, next
12:07.140 --> 12:11.540
to the Kaiserstrasse, for instance, and apply the Kanye operator, then
12:11.540 --> 12:16.780
we get a binary image, so an image where the pixels are only zero and
12:16.780 --> 12:20.580
one, illustrated here with white and black.
12:21.320 --> 12:26.720
And in this illustration, the black pixels are those which are found
12:26.720 --> 12:32.580
to be edge pixels, gray level edge pixels, while the white pixels in
12:32.580 --> 12:37.760
this illustration are those which are not edge pixels according to
12:37.760 --> 12:39.480
what the Kanye operator yields.
12:41.040 --> 12:46.780
To be honest, to get this illustration here on the slide so that you
12:46.780 --> 12:51.880
can see something, I have widened all the black lines a little bit,
12:52.020 --> 13:01.600
otherwise the video projector wouldn't have been able to show all
13:01.600 --> 13:02.360
these lines.
13:02.900 --> 13:07.000
So, thin lines disappear often if you use a video projector.
13:08.240 --> 13:09.540
Okay, so what can we see?
13:09.760 --> 13:14.580
Which are the pixels where we see these gray level edges?
13:14.960 --> 13:19.140
So, if you look here, we see the structure of the building, the
13:19.140 --> 13:23.100
boundary of the building here, where we have some contrast between
13:23.100 --> 13:24.940
dark and bright.
13:25.940 --> 13:33.100
Here in these areas where we have shadows, where some surfaces of the
13:33.100 --> 13:38.140
building are in the shadow while others are illuminated by the sun, we
13:38.140 --> 13:45.020
get strong contrast and this yields these gray level edges here.
13:47.500 --> 13:51.840
Then, for instance, here in the windows we have structures which can
13:51.840 --> 13:52.760
be found again.
13:53.460 --> 13:57.420
Some areas between bricks, where some bricks have a little bit
13:57.420 --> 14:01.340
different shade, creates some structures here.
14:01.820 --> 14:07.580
Then here, the contrast between the road and the curbstone creates a
14:07.580 --> 14:08.480
gray level edge.
14:09.120 --> 14:13.680
And the contrast between the curbstone and the pedestrian area also,
14:13.900 --> 14:19.100
at least sometimes, at some points, creates a gray level edge like
14:19.100 --> 14:19.300
that.
14:19.440 --> 14:23.300
So, what we can see is when we extract these gray level edges, they
14:23.300 --> 14:28.680
reveal some structure, some geometric information about the scene that
14:28.680 --> 14:29.420
we are observing.
14:30.360 --> 14:31.980
Okay, that's the cany operator.
14:32.580 --> 14:36.000
That's one possibility to calculate gray level edges.
14:36.580 --> 14:39.860
Another one is the approach according to Ma and Hildreth.
14:41.140 --> 14:42.620
It's just an alternative.
14:43.000 --> 14:45.620
It's not better, it's not worse, it's just an alternative.
14:46.040 --> 14:50.540
So, there might be different reasons for choosing the one or the other
14:50.540 --> 14:50.740
one.
14:50.840 --> 14:53.960
What is the idea of the Ma approach?
14:54.340 --> 15:00.620
The idea is a pixel is an edge pixel if the first order derivative is
15:00.620 --> 15:09.360
large, which means we have a strong contrast between dark and bright,
15:09.700 --> 15:14.000
and the second order derivative has a zero crossing.
15:15.420 --> 15:19.660
So, what that means, what we are searching here is we're searching
15:19.660 --> 15:25.160
pixels with a strong, with a large gradient, and at the same time
15:25.160 --> 15:31.020
which are zero crossings of the second order derivative, or better to
15:31.020 --> 15:33.540
say of the Laplace filtered image.
15:34.540 --> 15:37.760
Now, if we do that, and afterwards we do some thresholding on that,
15:38.140 --> 15:42.640
and if we do that, apply it onto the same image, then we get this
15:42.640 --> 15:45.180
result here, which we can see on the right hand side.
15:45.500 --> 15:48.800
Again, the structures of the building are revealed.
15:49.340 --> 15:54.420
So, you might wonder why isn't it that nice as the image on the slide
15:54.420 --> 15:54.700
before?
15:55.340 --> 15:59.940
The answer is because I was using an implementation that was not
15:59.940 --> 16:00.840
that...
16:03.620 --> 16:07.780
where the implementation was not that good as for the Kenny operator,
16:08.000 --> 16:08.820
say it like that.
16:08.820 --> 16:09.860
I'm sure.
16:10.520 --> 16:16.460
So, the only thing why some edges are missing in this area here is
16:16.460 --> 16:19.960
because the thresholding process that was implemented for this
16:19.960 --> 16:25.900
Mahildris approach was not that much elaborated as for the Kenny
16:25.900 --> 16:26.700
implementation.
16:27.180 --> 16:31.280
But indeed, both approaches are more or less equivalent.
16:31.980 --> 16:39.180
They yield results of the same quality if we would use implementations
16:39.180 --> 16:40.520
of the same quality, say.
16:41.820 --> 16:42.440
Okay.
16:42.720 --> 16:50.020
So, that's the way we can get these gray level edges.
16:51.900 --> 16:55.620
So, now let's change the topic a little bit.
16:55.760 --> 17:00.540
So, we talked about basic filtering techniques up to now.
17:01.280 --> 17:05.940
Second basic knowledge from computer vision that we need for the
17:05.940 --> 17:11.180
lecture is how are points in the world mapped onto the image?
17:11.340 --> 17:15.420
So, which is a geometric relationship between a point in the world and
17:15.420 --> 17:16.700
a point in the image?
17:17.020 --> 17:19.260
How can we describe this mapping?
17:19.900 --> 17:24.300
And then there's one rather simple model, but which is quite powerful
17:24.300 --> 17:27.700
and useful for most applications in computer vision.
17:28.180 --> 17:29.700
That's the pinhole camera model.
17:30.320 --> 17:32.680
And that is something that we will introduce here.
17:33.120 --> 17:36.040
So, the idea is, yeah, we have a camera here.
17:36.360 --> 17:40.260
We have some three-dimensional world where we can describe the
17:40.260 --> 17:45.880
geometry of the scene in a coordinate system, which I will name a
17:45.880 --> 17:49.140
world coordinate system because it describes the world.
17:49.420 --> 17:54.020
And I will use throughout this lecture these Greek letters xi, eta,
17:54.340 --> 17:59.700
and zeta in order to refer to coordinates in this world coordinate
17:59.700 --> 18:00.200
system.
18:00.500 --> 18:05.640
That means every time you find these variables, these Greek letters
18:05.640 --> 18:10.780
here, xi, eta, and zeta, you know that this coordinate is a coordinate
18:10.780 --> 18:13.560
represented in this world coordinate system.
18:13.760 --> 18:19.080
Now, this coordinate system is somewhere fixed in the world, either at
18:19.080 --> 18:20.280
a certain position.
18:20.660 --> 18:25.180
So, if I would like to define a world coordinate system for this
18:25.180 --> 18:29.600
lecture hall here, I would think about where should the origin be and
18:29.600 --> 18:34.400
how should the coordinate axis point in which directions, and then it
18:34.400 --> 18:35.860
would be fixed at that position.
18:36.480 --> 18:40.180
In our applications in driving, often the world coordinate system is
18:40.180 --> 18:45.380
fixed to the ego car with which we are driving, from which we are
18:45.380 --> 18:46.320
observing the world.
18:46.740 --> 18:52.480
Then we might say, okay, the origin is at a specific position inside
18:52.480 --> 18:56.240
of the ego vehicle, and the three coordinate axes are pointing to
18:56.240 --> 19:00.520
certain directions, say forward, sidewards, and upwards, or something
19:00.520 --> 19:01.080
like that.
19:02.540 --> 19:08.140
Okay, and now the scene is recorded with a camera, and then we see the
19:08.140 --> 19:09.420
image that we get.
19:09.880 --> 19:14.720
And there we have image coordinates, and I will use the letters u and
19:14.720 --> 19:17.240
v to refer to image coordinates.
19:17.460 --> 19:21.060
So, u and v always refer to coordinates in the image, u is the
19:21.060 --> 19:24.220
horizontal coordinate, and v the vertical one.
19:25.080 --> 19:30.940
And now the question is, how, in which way can we describe the mapping
19:30.940 --> 19:34.680
of points from here to here, and from here to here, and from here to
19:34.680 --> 19:34.840
here?
19:37.000 --> 19:44.680
The idea is to simplify the setup of a real camera, and to introduce
19:44.680 --> 19:48.580
this pinhole camera model, in which we say the camera is nothing else
19:48.580 --> 19:53.660
than a plane called the image plane, on which the light-sensitive
19:53.660 --> 20:05.880
chip, the imager, is put on, and to transform the analog light signal
20:05.880 --> 20:07.160
into a digital signal.
20:08.720 --> 20:12.300
And in front of this image plane, we have a perforated sheet.
20:13.700 --> 20:18.440
That means there is a sheet that doesn't let light through, except of
20:18.440 --> 20:21.540
a single point here, at which light comes through.
20:21.700 --> 20:28.220
This point is large enough that we don't have trouble with refraction
20:28.220 --> 20:36.780
phenomena, or, no, not refraction, light bending phenomena.
20:38.220 --> 20:43.700
But it is small enough so that we can simplify the whole mapping
20:43.700 --> 20:49.780
between points onto straight lines in which the light travels.
20:50.060 --> 20:57.220
That means a point from here is mapped to this point through a line
20:57.220 --> 20:59.980
that passes through this little hole here.
21:01.720 --> 21:06.000
Now we need to introduce a third coordinate system in order to
21:06.000 --> 21:07.640
describe this mapping.
21:08.100 --> 21:13.120
That's the so-called camera coordinate system, and the letters that
21:13.120 --> 21:16.460
are used to refer to coordinates in the camera coordinate system are
21:16.460 --> 21:18.000
x, y, and z.
21:18.400 --> 21:22.040
And this camera coordinate system is defined in such a way that its
21:22.040 --> 21:27.260
origin is inside of this small hole, which we later on will call the
21:27.260 --> 21:28.640
focal point of the camera.
21:29.680 --> 21:31.680
So that's the origin.
21:32.520 --> 21:37.440
Then the z-axis of this coordinate system is always orthogonal to the
21:37.440 --> 21:38.220
image plane.
21:40.200 --> 21:46.480
And the x and y coordinate axes are more or less arbitrary, with the
21:46.480 --> 21:50.640
only restriction that the x, y, z coordinate system should be an
21:50.640 --> 21:53.500
orthogonal right-handed coordinate system.
21:54.400 --> 21:59.700
So whether x points to the left or upward or downward doesn't matter
21:59.700 --> 22:01.500
that much later on.
22:01.500 --> 22:05.920
We will fix it in that way that the x coordinate system here, or the
22:05.920 --> 22:10.460
camera coordinate system, is parallel to the u coordinate of the
22:10.460 --> 22:10.860
image.
22:11.400 --> 22:15.340
By that, later on, this coordinate system is defined.
22:16.400 --> 22:20.800
Okay, now if we represent all points in this x, y, z coordinate
22:20.800 --> 22:25.440
system, then we can use the intercept theorem to describe this
22:25.440 --> 22:30.240
mapping, because then it holds that the deviation, the distance of the
22:30.240 --> 22:34.120
point in the world from this axis here, which is called the optical
22:34.120 --> 22:41.560
axis that passes through this focal point, and is parallel to the z
22:41.560 --> 22:48.040
-axis of the coordinate system, so that the lateral distance between
22:48.040 --> 22:57.580
the point in the world and this optical axis over the distance along
22:57.580 --> 23:01.740
the optical axis between this point in the world and the focal point
23:01.740 --> 23:06.920
of the camera is equal to the deviation of the image point in the
23:06.920 --> 23:13.020
image plane from the optical axis of the lateral deviation over the
23:13.020 --> 23:18.400
distance between this pinhole, the focal point, and the image plane.
23:18.400 --> 23:23.460
Yeah, that's just the intercept theorem, and based on that we can
23:23.460 --> 23:30.020
describe this relationship in a simple way.
23:30.520 --> 23:34.220
The second step that we need, once we are on the image plane, so this
23:34.220 --> 23:39.900
is a picture of the image plane, then we have the projection of the
23:39.900 --> 23:42.860
camera coordinate system onto the image plane.
23:43.040 --> 23:47.160
This yields these x-prime and y-prime coordinates, and we have the
23:47.160 --> 23:59.340
imager that is somehow part of the image plane, shown here in purple
23:59.340 --> 24:05.720
or in violet, and we have the coordinate system of the imager, the UV
24:05.720 --> 24:06.780
coordinate system.
24:06.940 --> 24:12.960
As I said, we usually define the camera coordinate system in such a
24:12.960 --> 24:17.240
way that this x-prime direction is parallel to the u-direction, and
24:17.240 --> 24:22.220
then the second step of describing this transformation between world
24:22.220 --> 24:25.880
coordinates and image coordinates is to describe the transformation
24:25.880 --> 24:30.160
between the points that we have represented in the camera coordinate
24:30.160 --> 24:35.560
system, the image points represented in camera coordinate systems, and
24:35.560 --> 24:41.120
then the same point represented in image coordinates in this UV
24:41.120 --> 24:42.120
coordinate system.
24:42.500 --> 24:47.480
And we see the origins are shifted with respect to each other, and the
24:47.480 --> 24:52.420
shift of the origins is known as the principal point.
24:52.540 --> 24:56.380
So the principal point is shift between the origins of the image
24:56.380 --> 25:02.800
coordinate system and the camera coordinate system, and the length
25:02.800 --> 25:06.840
unit that we use in the camera coordinate system and in the image
25:06.840 --> 25:08.420
coordinate systems are different.
25:08.700 --> 25:12.420
So in image coordinate system we have a length unit of one pixel
25:12.420 --> 25:19.800
length, while in the camera coordinate system we want to use physical
25:19.800 --> 25:25.220
meaningful length units, meters, inches, centimeters, millimeters,
25:25.500 --> 25:26.300
whatever you like.
25:27.460 --> 25:30.740
And therefore we have a scaling between the two coordinate systems.
25:31.380 --> 25:36.160
And this can be... so finally the transformation between the two
25:36.160 --> 25:39.380
coordinate systems can be represented with this formula.
25:39.840 --> 25:45.280
So once we know the position of a point in on the image plane in,
25:45.680 --> 25:49.720
yeah, so if we combine these two steps that we introduced up to now,
25:50.160 --> 25:54.220
we get a mapping that can be represented like that.
25:54.300 --> 25:59.240
Without showing you all details how it is derived, the result looks
25:59.240 --> 25:59.700
like that.
25:59.900 --> 26:06.220
So we can use, apply a point in the world that is represented in the
26:06.220 --> 26:11.020
camera coordinate system, x, y, z, then multiply it with this matrix.
26:11.540 --> 26:17.300
There are some entries here which describe the optical properties of
26:17.300 --> 26:17.820
the camera.
26:18.800 --> 26:24.520
Yeah, assume that we know these numbers here, that are shown here.
26:25.860 --> 26:31.260
And then what we get is the image coordinate uv with an additional
26:31.260 --> 26:34.020
third row one, but not exactly that.
26:34.280 --> 26:42.020
But we get this vector up to some scaling factor z, the distance of a
26:42.020 --> 26:43.700
point from the camera z.
26:44.200 --> 26:48.520
So now if we want to calculate the image position u and v, what can we
26:48.520 --> 26:48.760
do?
26:49.260 --> 26:54.260
We take this result of what we get as output when we apply this
26:54.260 --> 26:55.860
calculation on the right hand side.
26:56.520 --> 27:00.620
Then we know that the third coordinate of the vector that we get is
27:00.620 --> 27:01.980
equal to z times one.
27:02.680 --> 27:05.440
And the first entry is equal to z times u.
27:05.760 --> 27:10.340
So if we divide the first entry by the third entry, we divide z times
27:10.340 --> 27:12.320
u over z times one.
27:12.920 --> 27:17.960
So z cancels and what remains is u.
27:18.940 --> 27:22.900
And if you want to calculate v, the image coordinate in vertical
27:22.900 --> 27:27.120
direction, we know the second entry is z times v, the third is z times
27:27.120 --> 27:27.420
one.
27:27.760 --> 27:32.680
We calculate z times v over z times one, and what we get is v, the
27:32.680 --> 27:36.320
vertical position of the point in the image.
27:37.520 --> 27:46.180
So once we know this matrix A here, we easily can calculate in which
27:46.180 --> 27:51.380
way points which are given as three-dimensional points in the camera
27:51.380 --> 27:56.860
coordinate system are mapped into the image.
27:57.500 --> 28:00.340
We can easily calculate the u and v coordinate.
28:00.780 --> 28:05.560
In the other direction, unfortunately, it's not that easy, because we
28:05.560 --> 28:10.420
only know u and v, we only know two numbers, u and v, and what we aim
28:10.420 --> 28:13.780
is we want to get three numbers, x, y, and z.
28:14.500 --> 28:22.220
This matrix A is typically, or it is always a full rank matrix, it's
28:22.220 --> 28:26.540
invertible, so we can do that, we can invert that, and then we get x,
28:26.600 --> 28:35.840
y, z is equal to A inverse times z times uv1, but you see z, this
28:35.840 --> 28:40.440
value z, that's the distance of a point from the camera, is something
28:40.440 --> 28:45.140
that we want to calculate, but in order to calculate it, we need to
28:45.140 --> 28:46.800
know what its value is.
28:47.000 --> 28:49.520
So z is something that we can't calculate.
28:49.520 --> 28:57.740
We can only determine x and y depending on z, because if we only have
28:57.740 --> 29:02.440
an image, we do not know at which position a point in the three
29:02.440 --> 29:04.180
-dimensional world is that we see.
29:04.360 --> 29:10.900
We can calculate a line of sight, a direction from which this point is
29:10.900 --> 29:15.720
located, but we cannot determine how far from the camera it is.
29:17.860 --> 29:22.820
So, yeah, this matrix A contains these parameters that are shown here,
29:22.940 --> 29:29.800
alpha prime, beta prime are actually scaling factors, u0, v0 are the
29:29.800 --> 29:34.020
shift of the two camera coordinate systems, the origins of the two
29:34.020 --> 29:37.580
camera coordinate systems, so it's a position of the principal point,
29:37.700 --> 29:42.720
and the theta is some skewing angle that is not that interesting for
29:42.720 --> 29:46.260
us, because for most cameras it's 90 degree.
29:47.580 --> 29:51.240
These parameters which occur inside of this matrix are called
29:51.240 --> 29:57.080
intrinsic parameters, and they describe the actual optical properties
29:57.080 --> 29:57.940
of the camera.
30:00.280 --> 30:05.600
And if you take a camera and you zoom in a scene or zoom out of a
30:05.600 --> 30:11.440
scene, then this matrix is changing, the intrinsic parameters are
30:11.440 --> 30:11.840
changing.
30:13.480 --> 30:23.240
So, now, I started with saying that typically you want to represent
30:23.240 --> 30:27.440
points in the world not in the camera coordinate system, but because
30:27.440 --> 30:33.140
this camera coordinate system is hard to determine, in practice it's
30:33.140 --> 30:37.300
almost impossible to determine exactly where the origin of the camera
30:37.300 --> 30:38.800
is, where the focal point is.
30:41.060 --> 30:45.620
So, it's better to represent things in the world in another coordinate
30:45.620 --> 30:51.000
system, in a world coordinate system that we can define based on the
30:51.000 --> 30:53.760
needs of the application in which we are interested in.
30:54.460 --> 30:58.120
So, let's assume we have something like that, so here's our world
30:58.120 --> 31:01.820
coordinate system, and here's the camera with the camera coordinate
31:01.820 --> 31:02.360
system.
31:02.900 --> 31:07.680
Now, we need another transformation which enables us to transform
31:07.680 --> 31:11.220
points, which are three-dimensional points in the world, which are
31:11.220 --> 31:17.680
represented in the world coordinate system into a representation in
31:17.680 --> 31:19.380
the camera coordinate system.
31:19.520 --> 31:23.080
So, the same point, but represented in different coordinate systems.
31:23.580 --> 31:27.700
For both coordinate systems, we assume they are right-handed
31:27.700 --> 31:31.860
orthogonal coordinate systems, and we assume that they use the same
31:31.860 --> 31:32.780
length unit.
31:33.180 --> 31:36.940
So, if we calculate in the world coordinate system with meters, we
31:36.940 --> 31:43.000
also calculate in meters when we are using the camera coordinate
31:43.000 --> 31:43.500
system.
31:43.880 --> 31:47.800
So, we use the same length unit, and that means we don't have to
31:47.800 --> 31:52.100
consider scaling between the two coordinate systems, but only the
31:52.100 --> 31:56.040
shift of the origin and the rotation between the two coordinate
31:56.040 --> 31:56.580
systems.
31:58.080 --> 32:02.100
And to do that, we have to introduce two additional parameters which
32:02.100 --> 32:05.820
are known as the so-called extrinsic parameters of a camera.
32:06.260 --> 32:09.900
They describe the position of the camera and the orientation of the
32:09.900 --> 32:10.260
camera.
32:12.200 --> 32:17.120
And by that, they describe in which way the coordinates between the
32:17.120 --> 32:21.320
image, the, sorry, the world coordinate system and the camera
32:21.320 --> 32:24.040
coordinate system have to be transformed.
32:24.820 --> 32:31.560
So, these two parameters are T, a vector T, that's a vector with three
32:31.560 --> 32:36.080
entries that describes the position of the origin of the world
32:36.080 --> 32:39.960
coordinate system with respect to the camera coordinate system.
32:40.680 --> 32:46.600
So, it describes where the origin of the world coordinate system is
32:46.600 --> 32:49.180
when we want to represent it in camera coordinates.
32:49.960 --> 32:54.660
And then there is this rotation matrix R that describes the rotation
32:54.660 --> 32:59.960
between the two coordinate systems, in which how much we have to
32:59.960 --> 33:04.540
rotate the two coordinate systems to align them.
33:06.420 --> 33:10.440
So, these are called the extrinsic camera parameters.
33:11.160 --> 33:17.500
And if we also add this additional transformation to our modeling,
33:17.860 --> 33:21.900
then we end up with this formula here that is given here on the slide.
33:22.300 --> 33:25.720
So, now we start with Xi, Eta, Theta.
33:26.100 --> 33:30.220
So, the coordinates of a point, of a three-dimensional point in the
33:30.220 --> 33:35.020
world represented in the world coordinate system, we first apply a
33:35.020 --> 33:37.400
rotation and a translation on it.
33:37.720 --> 33:42.280
So, this is actually a three by four matrix here where the first three
33:42.280 --> 33:48.420
columns constitute the rotation matrix R and the fourth column the
33:48.420 --> 33:49.700
translation vector T.
33:50.820 --> 33:57.820
So, now if we multiply this three by four matrix with this vector Xi,
33:57.900 --> 34:05.740
Eta, Theta one, what we actually do is we first add the... we first
34:05.740 --> 34:10.940
apply the rotation to Xi, Eta, Theta and afterwards add the offset T
34:10.940 --> 34:11.780
to the result.
34:12.540 --> 34:14.460
So, that is shown what here.
34:14.460 --> 34:20.140
And then we end up with the representation in the camera coordinate
34:20.140 --> 34:20.680
system.
34:22.160 --> 34:27.280
So, the parameters, as I said, are this translation vector or shift
34:27.280 --> 34:30.660
between the two coordinate systems T and the rotation between the two
34:30.660 --> 34:31.920
coordinate systems R.
34:33.300 --> 34:37.380
And indeed, these are six degrees of freedom.
34:37.580 --> 34:45.640
So, three degrees of freedom here in the rotation matrix actually.
34:45.780 --> 34:50.340
Three angles, one angle with respect to each coordinate axis.
34:52.120 --> 34:56.380
Yeah, and that's the transformation from world coordinates to camera
34:56.380 --> 34:56.960
coordinates.
34:57.580 --> 35:01.040
Now, these parameters are called extrinsic parameters because they
35:01.040 --> 35:04.200
only describe the position and orientation of the camera.
35:04.600 --> 35:08.960
So, if I take a camera and I move it around, then the intrinsic
35:08.960 --> 35:12.200
parameters are not changing, but the extrinsic parameters are
35:12.200 --> 35:12.580
changing.
35:13.060 --> 35:20.700
If I keep the camera on its position and I change somehow the zoom in
35:20.700 --> 35:26.860
or zoom out, then the intrinsics are changing mainly and the extrinsic
35:26.860 --> 35:29.880
parameters are changing slightly.
35:31.260 --> 35:35.040
In an ideal case, they are not changing at all, but in practice they
35:35.040 --> 35:35.900
are changing slightly.
35:36.340 --> 35:38.720
So now, taking both steps together.
35:39.180 --> 35:43.820
So, the first step that does the first transformation from the world
35:43.820 --> 35:48.260
coordinates to the camera coordinate system and the second step that
35:48.260 --> 35:52.340
starts from the camera coordinates and ends up in the image
35:52.340 --> 35:56.180
coordinates, we get this concatenation of these two operations.
35:57.420 --> 36:01.500
And you can see that here we have the extrinsic parameters R and T and
36:01.500 --> 36:04.140
we have the intrinsic parameters A.
36:04.700 --> 36:08.220
And that means once we know all these parameters, we can describe in
36:08.220 --> 36:11.600
which way points in the three-dimensional world are mapped into the
36:11.600 --> 36:11.940
image.
36:12.660 --> 36:18.500
So, of course, sometimes you might wonder from where do we get these
36:18.500 --> 36:19.080
parameters.
36:19.080 --> 36:23.420
So, this matrix R T is a 3 by 4 matrix.
36:24.460 --> 36:26.900
So, it's by definition not invertible.
36:28.980 --> 36:34.420
But, of course, the rotation matrix as such is invertible and very
36:34.420 --> 36:35.380
easily invertible.
36:35.500 --> 36:37.140
You just have to take the transpose of it.
36:37.760 --> 36:42.240
And, of course, the operation to shift a point by an offset of T is
36:42.240 --> 36:43.680
also an invertible operation.
36:44.560 --> 36:48.000
So, that means the operation that is implemented by this matrix is
36:48.000 --> 36:50.660
invertible, fully invertible, always invertible.
36:51.180 --> 36:54.340
But you cannot invert this matrix.
36:54.580 --> 36:57.600
But you can split it up into two steps and then invert each of the
36:57.600 --> 36:57.880
steps.
36:58.480 --> 36:58.600
Yeah.
36:58.900 --> 37:05.820
So, maybe you might ask from where do I get these parameters, the
37:05.820 --> 37:07.660
intrinsic and extrinsic parameters.
37:08.260 --> 37:14.080
For the intrinsic parameters, maybe the camera provider is providing
37:14.080 --> 37:19.120
to you some information about the focal length of the lens, about the
37:19.120 --> 37:23.160
resolution of the imager, things like that.
37:23.340 --> 37:28.420
However, these are not sufficiently accurate for doing all our
37:28.420 --> 37:29.080
calculations.
37:29.740 --> 37:34.100
And, of course, for the extrinsic parameters, you might start with a
37:34.100 --> 37:39.280
laser measurement instrument or something like that to measure the
37:39.280 --> 37:39.580
offset.
37:39.920 --> 37:41.600
But this is not accurate enough.
37:41.940 --> 37:45.440
So, what we need to determine these parameters is a so-called
37:45.440 --> 37:46.740
calibration process.
37:47.800 --> 37:53.920
This calibration process typically consists out of a scene, an
37:53.920 --> 37:56.940
artificial scene that we show to the camera.
37:57.880 --> 37:57.880
Yeah.
37:58.320 --> 38:02.540
This could be, so this is a little bit an old picture, more than 10
38:02.540 --> 38:07.600
years ago, where we put some chessboard markers on the ground and then
38:07.600 --> 38:10.040
measured out this scene exactly.
38:10.980 --> 38:15.360
Or, the more modern approach is to use these chessboards and show
38:15.360 --> 38:19.040
these chessboards to the camera from different orientations and at
38:19.040 --> 38:19.820
different distances.
38:20.700 --> 38:25.340
And then there are some algorithms that are taking these images and
38:25.340 --> 38:28.020
calculating all these parameters that we need.
38:28.900 --> 38:33.560
So, without giving you the details, there are algorithms from which we
38:33.560 --> 38:35.300
can determine these parameters.
38:35.500 --> 38:39.400
So, for all things, for all tasks that we are discussing in this
38:39.400 --> 38:44.140
lecture of automotive vision, we might assume that we know the
38:44.140 --> 38:49.440
intrinsic and extrinsic parameters of the camera and that we obtain
38:49.440 --> 38:51.660
them from some calibration process before.
38:53.400 --> 38:58.500
Okay, that's our brief introduction into the basic things from
38:58.500 --> 38:59.900
computer vision that we need.
39:00.700 --> 39:06.780
Yeah, I guess that with this knowledge you should be able to solve all
39:06.780 --> 39:10.580
the things and understand all things that we discuss here in the
39:10.580 --> 39:10.860
lecture.
39:11.440 --> 39:15.680
Of course, if you want to know more about these things, consider
39:15.680 --> 39:16.360
literature.
39:16.660 --> 39:19.660
They are really basic things that you find in all textbooks on
39:19.660 --> 39:20.280
computer vision.
39:20.860 --> 39:25.620
Or, check out the slides from the machine vision lecture from winter
39:25.620 --> 39:31.680
term and learn from there more details.
39:32.220 --> 39:34.800
That's the point where we can start with the second chapter.
39:37.960 --> 39:40.400
Second lecture on binocular vision.
39:41.420 --> 39:50.560
So, as we just heard with a single camera, we can map three
39:50.560 --> 39:55.720
-dimensional points in the world onto a two-dimensional image.
39:55.720 --> 40:00.120
And obviously, we lose one dimension by doing that.
40:01.200 --> 40:03.620
The inverse process is not that easy.
40:04.600 --> 40:09.800
When we start from a pixel in the image, we are not able to determine
40:09.800 --> 40:13.900
the position of the three-dimensional point that we just observed.
40:14.400 --> 40:16.880
We are able to determine a line of sight.
40:17.320 --> 40:21.120
We are able to determine from which direction the line of sight
40:21.120 --> 40:22.400
entered the camera.
40:23.140 --> 40:27.180
But we do not know how far away the point is that we have seen.
40:28.320 --> 40:34.780
And to overcome this problem, one technique is binocular vision or
40:34.780 --> 40:35.580
stereo vision.
40:35.800 --> 40:41.320
The idea is to use two cameras instead of just one and use the
40:41.320 --> 40:44.800
information that we get from these two cameras in order to reconstruct
40:45.920 --> 40:48.960
the 3D points accurately.
40:50.140 --> 40:53.400
Okay, that is what we will discuss here in this chapter.
40:54.460 --> 40:55.880
Yeah, some references.
40:56.500 --> 41:01.000
So, binocular vision is something that is a topic that is not
41:01.000 --> 41:02.260
completely new.
41:02.700 --> 41:08.800
Therefore, you find information on binocular vision in most textbooks
41:08.800 --> 41:09.960
on computer vision.
41:10.400 --> 41:12.760
So, just some of them are shown here.
41:12.880 --> 41:17.980
So, the book of Hartley and Sisserman, Multiple View Geometry, mainly
41:17.980 --> 41:24.280
focuses on this geometrical reasoning between points in the world and
41:24.280 --> 41:25.740
points in the image.
41:26.880 --> 41:31.020
And of course, it also discusses the topic of binocular vision.
41:31.580 --> 41:35.700
Then also in other textbooks like the Davis or Forsythe textbook, you
41:35.700 --> 41:38.380
find chapters on binocular vision.
41:38.540 --> 41:40.720
Of course, in many more textbooks as well.
41:41.940 --> 41:43.280
Okay, let's start.
41:43.600 --> 41:47.260
And we start with a look at the geometry.
41:47.840 --> 41:52.060
Yeah, we just discussed the pinhole camera model and we want to start
41:52.060 --> 41:56.120
from the pinhole camera model to develop this idea.
41:56.920 --> 42:01.660
So, our basic thing, our basic idea in binocular vision is the
42:01.660 --> 42:01.960
following.
42:02.740 --> 42:05.820
So, let's assume we have a three-dimensional scene.
42:06.040 --> 42:12.860
Some person is kicking a soccer ball somewhere in front of a building.
42:14.380 --> 42:18.140
If we have one camera, say this one here, which is named the left
42:18.140 --> 42:20.840
camera, we get one image.
42:22.120 --> 42:26.500
And from this image, we can reconstruct, if we focus on this ball, we
42:26.500 --> 42:30.360
can reconstruct the line of sight of the ball using the pinhole camera
42:30.360 --> 42:30.740
model.
42:31.300 --> 42:35.800
Now, we can construct one line and we know the point that we observed
42:35.800 --> 42:41.940
must be located somewhere on this line, at some point on this line.
42:43.700 --> 42:46.880
Okay, but we do not know how far away the ball is.
42:47.260 --> 42:54.540
Now, if we add a second camera, say, name it the right camera, and we
42:54.540 --> 42:59.280
make a picture of the same scene, shown here, of course the picture
42:59.280 --> 43:03.580
will look slightly different from the first picture of the other
43:03.580 --> 43:08.560
camera, because it's made from a little different point of view.
43:09.660 --> 43:14.300
Again, we can extract the point here, the position of the ball in the
43:14.300 --> 43:18.120
image, and based on that we can determine a line of sight, this one
43:18.120 --> 43:18.360
here.
43:18.980 --> 43:24.220
And now, hopefully, both lines of sight intersect at a certain point,
43:24.220 --> 43:27.300
and then we can conclude that this point of intersection, where these
43:27.300 --> 43:33.160
two lines of sight intersect, is the true three-dimensional point at
43:33.160 --> 43:35.800
which this object is located in the world.
43:36.340 --> 43:40.000
Yeah, that's the basic idea that is shown here.
43:41.360 --> 43:45.820
Now, we can see that we have a triangle here, so these three points
43:45.820 --> 43:49.700
establish a triangle, and then we have to do some geometrical
43:49.700 --> 43:53.840
reasoning over triangles to determine this position in the three
43:53.840 --> 43:54.660
-dimensional world.
43:56.040 --> 43:57.380
So, what did we do?
43:57.540 --> 44:02.520
So, we have a three-dimensional world, R3, so all points in the world
44:02.520 --> 44:07.800
can be represented with a three-dimensional vector, and with the two
44:07.800 --> 44:13.940
cameras we map these to a two-dimensional position in the left image,
44:14.220 --> 44:17.180
and another two-dimensional position in the right image.
44:17.180 --> 44:25.720
So, what we do is, we have a mapping from R3 to R2 times R2.
44:26.060 --> 44:35.280
So, in total, we have a mapping from R3 to R4, if we consider the
44:35.280 --> 44:40.380
concatenation, or the combination, better to say, of the position of
44:40.380 --> 44:44.260
the respective point in the left camera image and the right camera
44:44.260 --> 44:44.560
image.
44:44.560 --> 44:50.500
Then we have four coordinates, U left, V left, U right, V right, which
44:50.500 --> 44:52.580
establish a four-dimensional space.
44:53.400 --> 44:56.600
So, a mapping from a three-dimensional space to a four-dimensional
44:56.600 --> 45:04.480
space, of course, cannot fill the whole four-dimensional space.
45:04.860 --> 45:08.940
But the image, the area actually where points are mapped to in this
45:08.940 --> 45:13.300
four -dimensional space, is a subspace, a three-dimensional subspace
45:13.300 --> 45:15.320
of this four-dimensional space.
45:16.240 --> 45:20.080
Let's have a look on how this subspace looks like and how we can
45:20.080 --> 45:20.840
describe it.
45:22.280 --> 45:25.940
Once we have that, then we can reconstruct the scene like that.
45:27.100 --> 45:31.000
For that purpose, we again start with a pinhole camera model.
45:31.840 --> 45:33.600
We simplify the whole thing.
45:33.920 --> 45:41.480
P is the point in the world that we want to observe, capital P. Then
45:41.480 --> 45:45.560
FL should be the focal point of the left camera.
45:46.660 --> 45:51.240
I leave away the coordinate system here, but you might assume a camera
45:51.240 --> 45:55.920
coordinate system that has its origin here and some z coordinate in
45:55.920 --> 45:59.040
some direction, x and y in the other direction.
45:59.880 --> 46:03.840
So, we just learned that the image plane in a real camera is behind
46:03.840 --> 46:04.800
the focal point.
46:05.420 --> 46:07.640
And that would be the correct drawing.
46:08.240 --> 46:17.560
However, in this case, for some mathematical reasons, it is simpler to
46:17.560 --> 46:22.060
assume that the image plane is in front of the camera, of the focal
46:22.060 --> 46:28.700
point, at the distance of one, so that the distance between the origin
46:28.700 --> 46:32.460
and the focal and the image plane is one.
46:33.120 --> 46:37.360
So, we actually, if we follow this pinhole camera model, we can do it
46:37.360 --> 46:37.900
like that.
46:38.140 --> 46:43.540
The only difference between the two ideas is that the images are
46:43.540 --> 46:46.800
mirrored, are reflected at the principal point.
46:47.720 --> 46:48.900
That's the only difference.
46:49.000 --> 46:54.760
But we can do the same reasoning as we did in chapter one, with a
46:54.760 --> 46:59.700
pinhole camera as well, for an image plane that is in front of the
46:59.700 --> 47:01.740
focal point instead of behind it.
47:02.800 --> 47:05.780
But that makes reasoning a little bit simpler.
47:07.260 --> 47:11.580
Okay, so that's the image plane, our virtual left image plane in front
47:11.580 --> 47:12.480
of the focal point.
47:13.060 --> 47:17.180
The yellow line is a line of sight, and that means the point of
47:17.180 --> 47:21.520
intersection of this line of sight with a virtual image plane yields a
47:21.520 --> 47:29.280
point, say QL, and that's the image point at which we see this point,
47:29.420 --> 47:30.980
capital P, in the image.
47:31.300 --> 47:37.340
Okay, and QL is a coordinate in the camera coordinate system of the
47:37.340 --> 47:38.120
left camera.
47:40.160 --> 47:44.700
Okay, now we have a second camera, say the right camera, and we have
47:44.700 --> 47:45.560
the same story.
47:45.860 --> 47:51.400
The camera is described by the focal point, FR, and a virtual image
47:51.400 --> 47:56.800
plane that we put in front of the focal point of the camera at a
47:56.800 --> 47:57.940
distance of one.
47:58.840 --> 48:03.260
And again, this line of sight, yellow, intersects with this virtual
48:03.260 --> 48:07.420
image plane and yields the coordinate of the image point QR.
48:09.320 --> 48:13.880
So, we add another line here, namely the line that connects the two
48:13.880 --> 48:18.280
focal points, the focal point of the left and the right camera.
48:18.980 --> 48:22.680
And this is often called the baseline of the binocular camera system,
48:22.900 --> 48:28.420
and it might also intersect with this image planes in one point with
48:28.420 --> 48:30.100
the left virtual image plane.
48:30.280 --> 48:32.500
Let's call this point of intersection EL.
48:33.260 --> 48:40.920
That has a name called the apipolar point, the apipole in the left
48:40.920 --> 48:45.860
image, and the other point of intersection, let's name it ER, as the
48:45.860 --> 48:47.200
right apipole.
48:48.560 --> 48:55.100
Okay, now what we can further do is we can calculate the line at which
48:55.640 --> 49:01.080
the image plane, the virtual image plane, intersects with a plane that
49:01.080 --> 49:04.780
is established by these three points P, FL, and FR.
49:05.660 --> 49:10.620
Yes, three points which are not collinear in the three-dimensional
49:10.620 --> 49:14.820
world establish or define a plane.
49:15.880 --> 49:21.260
Yeah, this plane has a name, it's called apipolar plane, and this
49:21.260 --> 49:26.880
plane intersects with the two image planes in a certain line, and this
49:26.880 --> 49:29.220
line is called the apipolar line.
49:29.580 --> 49:31.880
And it's shown here in dashed blue.
49:32.100 --> 49:33.360
That's the apipolar line.
49:33.700 --> 49:39.620
And of course, QL and EL are part of this line in the left image, and
49:39.620 --> 49:48.160
ER and QR are points in the right image plane, which are on this line.
49:50.000 --> 49:55.240
Okay, so yeah, the other things are... Okay, and now we can conclude
49:55.240 --> 50:00.480
that all points which are part of this apipolar plane, yeah, this
50:00.480 --> 50:05.720
plane that is established by this triangle, and which are somehow in
50:05.720 --> 50:14.920
front of the left camera, are mapped on two points on the apipolar
50:14.920 --> 50:17.000
line in the left image, the dashed one here.
50:17.380 --> 50:22.600
And all points which are part of the apipolar plane, and which are in
50:22.600 --> 50:27.060
front of the right camera, are mapped to points on the apipolar line
50:27.060 --> 50:29.120
in the right camera image.
50:29.420 --> 50:33.760
So now, we start with a little bit of mathematics and formula.
50:35.280 --> 50:39.440
The first thing is, we just described the relationship between the
50:39.440 --> 50:45.420
position QL and the position of this point P. And since we are faced
50:45.420 --> 50:50.900
with different coordinate systems, I want to use upper indices to
50:50.900 --> 50:56.000
refer to the coordinate system in which a vector is represented.
50:56.480 --> 51:02.180
So there might be the upper index capital L, that describes that this
51:02.180 --> 51:06.360
vector is represented in the camera coordinate system of the left
51:06.360 --> 51:06.740
camera.
51:07.380 --> 51:13.960
And there is the upper index capital R, that describes that the vector
51:13.960 --> 51:17.100
is represented in the coordinate system of the right camera.
51:18.880 --> 51:24.820
So let's start with this formula here, considering the point QL
51:24.820 --> 51:28.860
represented in the camera coordinate system of the left camera.
51:29.960 --> 51:37.200
So this point QL is created from P. So P is the vector that describes
51:37.200 --> 51:42.080
the position of this three-dimensional point P that we are interested
51:42.080 --> 51:42.420
in.
51:42.420 --> 51:48.420
So it is located in the same direction as the point P, seen from the
51:48.420 --> 51:49.440
focal point FL.
51:50.300 --> 51:54.880
But of course, the distance from the camera is not the same.
51:55.720 --> 52:00.900
So how can we get this position QL, upper index L?
52:01.320 --> 52:09.840
Well, we take the vector PL, upper index L, and somehow scale this
52:09.840 --> 52:14.340
vector with 1 over ZLL.
52:14.740 --> 52:21.060
So Z is actually the distance of this point P from the focal point F
52:21.060 --> 52:24.680
along the optical axis of the left camera.
52:25.220 --> 52:28.580
So it's actually a third entry of this vector PL.
52:30.340 --> 52:34.620
So what we do is, well, we divide this vector by ZLL.
52:35.160 --> 52:39.920
That means we project it onto a plane that has a distance of 1 from
52:39.920 --> 52:41.440
the focal point of the camera.
52:42.620 --> 52:44.880
And then we multiply it by 1.
52:45.840 --> 52:49.640
So QLL can be calculated in this way from PLL.
52:50.640 --> 52:55.760
In an analogous way, we can do the same calculation to calculate QR.
52:56.940 --> 53:04.560
Now we work in the camera coordinate system of the right camera, and
53:04.560 --> 53:10.900
we argue that QR, upper index R, is equal to PR, upper index R,
53:11.680 --> 53:15.300
rescaled by a factor of 1 over ZRR.
53:17.580 --> 53:18.220
Okay?
53:19.300 --> 53:25.180
So we go into the same direction, but not that far as to reach the
53:25.180 --> 53:29.400
point P, but only that far to end up at the image plane.
53:29.880 --> 53:31.680
And the same here at this side.
53:32.100 --> 53:43.380
So, now, we know that, yeah, we know that these three points, FL, FR,
53:43.540 --> 53:46.100
and P, create a plane.
53:47.940 --> 53:50.900
They are element of a plane.
53:51.880 --> 53:57.860
And that means that also these vectors B that connect FL and FR, then
53:57.860 --> 54:03.600
this vector PL that connects FL and P, and the vector PR that connects
54:03.600 --> 54:08.520
FR and P, that they are element of this plane.
54:08.980 --> 54:13.560
And we can use that by saying, okay, if we take the cross product or
54:13.560 --> 54:19.900
outer product of B and PL, so we take this vector and we take this
54:19.900 --> 54:21.860
vector and calculate the outer product.
54:24.180 --> 54:26.500
Do you remember what the outer product was?
54:26.680 --> 54:27.980
Cross product, outer product?
54:30.060 --> 54:31.420
You remember it?
54:32.100 --> 54:32.520
Yes?
54:33.800 --> 54:34.480
Yes?
54:34.840 --> 54:35.780
I remember it.
54:35.820 --> 54:37.100
Who else remembers it?
54:37.960 --> 54:38.440
No?
54:38.700 --> 54:40.000
Who doesn't remember it?
54:41.200 --> 54:41.900
One person?
54:42.240 --> 54:42.880
Two persons?
54:43.640 --> 54:44.200
Okay.
54:44.420 --> 54:49.340
So, take a textbook in math and read it for two persons.
54:49.820 --> 54:51.540
Don't explain it in detail.
54:51.960 --> 54:55.520
But what the outer product is, it takes these two vectors and
54:55.520 --> 54:58.900
calculates a vector that is orthogonal on both of them.
54:59.940 --> 55:00.080
Yeah?
55:00.560 --> 55:03.980
And the length of this vector is somehow related to the length of B
55:03.980 --> 55:06.620
and PL and the angle between both.
55:07.420 --> 55:11.880
But the main important thing is that the cross product of B and PL is
55:11.880 --> 55:14.340
a vector that is orthogonal to these two vectors.
55:14.620 --> 55:18.100
And that means it's a vector that is orthogonal to this plane here.
55:18.760 --> 55:21.760
Now, if we take this vector that is orthogonal to this plane and
55:21.760 --> 55:26.120
multiply it with a vector which is part of the plane, multiply in the
55:26.120 --> 55:29.940
sense of the inner product, the dot product, scalar product, inner
55:29.940 --> 55:33.940
product, yeah, like here, then what we get is zero because these two
55:33.940 --> 55:36.540
vectors are orthogonal to each other.
55:37.240 --> 55:44.800
That means, if we take this cross product and multiply this PR, then
55:44.800 --> 55:45.360
we get zero.
55:45.600 --> 55:49.100
Of course, we have to consider, we have to be sure that we do all this
55:49.100 --> 55:53.740
calculation in representations of the same coordinate system.
55:54.720 --> 55:54.800
Yeah?
55:55.120 --> 56:01.000
And therefore, you see that here, the upper index here that indicates
56:01.000 --> 56:03.060
the coordinate system is always L.
56:03.580 --> 56:06.260
I could also choose R if I would like.
56:06.420 --> 56:08.500
That would be the same result.
56:08.860 --> 56:10.040
But here I choose L.
56:10.180 --> 56:13.600
So, now we have this strange situation that we represent this point
56:13.600 --> 56:19.360
PR, this vector PR, yeah, this describes the position of the point P
56:19.360 --> 56:27.080
with respect to the right camera in the coordinate system of the left
56:27.080 --> 56:27.560
camera.
56:27.820 --> 56:27.820
Yeah?
56:29.260 --> 56:32.340
A little bit strange at the moment, but necessary.
56:32.580 --> 56:35.860
Otherwise, we couldn't do this calculation at that point.
56:36.420 --> 56:37.420
So, you had a question?
56:38.880 --> 56:39.960
Exactly that, okay.
56:40.320 --> 56:40.320
Yeah.
56:41.380 --> 56:41.700
Okay.
56:41.900 --> 56:42.960
Now, okay.
56:43.840 --> 56:45.120
Now, let's go on.
56:45.260 --> 56:46.500
Start with this equation.
56:47.000 --> 56:48.420
So, this must always hold.
56:49.680 --> 56:51.100
Start with this equation.
56:51.870 --> 56:57.100
And, of course, what we want to do is we want to get rid of this PRL
56:57.100 --> 57:01.260
and transform it into PRR somehow.
57:01.620 --> 57:01.620
Yeah?
57:02.660 --> 57:04.460
So, and this is done here.
57:04.560 --> 57:05.300
This is shown here.
57:05.560 --> 57:11.380
So, we need to transform a vector represented in the camera coordinate
57:11.380 --> 57:13.640
system of the left-hand coordinate.
57:14.240 --> 57:20.680
Sorry, we have to do a transformation between the coordinates of the
57:20.680 --> 57:23.580
right camera coordinate system and the left coordinate system.
57:24.060 --> 57:26.760
So, and this is shown here.
57:26.860 --> 57:31.060
So, we know both coordinate systems are orthogonal right-handed
57:31.060 --> 57:32.200
coordinate systems.
57:32.400 --> 57:33.880
They use the same length unit.
57:34.300 --> 57:37.000
So, if one is calculating in meters, the other as well.
57:38.460 --> 57:42.560
Okay, that means the transformation between these two coordinate
57:42.560 --> 57:48.320
systems requires an offset shift, yeah, which shifts the origins of
57:48.320 --> 57:50.720
one coordinate to the origin of the other coordinate.
57:51.440 --> 57:57.280
And it requires a rotation matrix in order to align the orientation of
57:57.280 --> 57:58.460
the two coordinate systems.
57:59.320 --> 58:04.680
So, that means, in general, there will be, of course, a vector, b,
58:05.020 --> 58:10.860
that's exactly this vector, the green vector, b, that shifts the
58:10.860 --> 58:15.120
origins of the two coordinate systems into each other.
58:16.080 --> 58:19.520
And there will be a rotation matrix, which I have denoted as D.
58:21.160 --> 58:25.380
So, D should be a rotation matrix, not necessarily a diagonal matrix,
58:25.520 --> 58:31.940
a rotation matrix, from German, Drei-matrix, yeah, and therefore D.
58:33.900 --> 58:40.500
And with these, I can take a vector, represent a point that is
58:40.500 --> 58:45.300
represented in the coordinates of the right-hand, of the right camera
58:45.300 --> 58:50.240
coordinate system, and transform it into the coordinates of the left
58:50.240 --> 58:54.100
-hand coordinate system, yeah, of the left camera coordinate system,
58:54.160 --> 58:54.360
sorry.
58:55.700 --> 58:57.580
So, now let's start with that.
58:58.140 --> 59:01.000
So, what we have is P, R, L.
59:03.460 --> 59:07.880
That's the vector, P, R, L, that's a vector from here to here, from
59:07.880 --> 59:13.540
the point F, R to the point P. So, that's the same as, so going from
59:13.540 --> 59:17.860
here to here is the same as going from F, R to F, L, and then going
59:17.860 --> 59:26.020
from F, L to P. That means that's going from F, R to F, L, so that's a
59:26.020 --> 59:31.720
vector from F minus the vector from F, L to F, R, plus the vector from
59:31.720 --> 59:39.160
F, L to P. Represented in the left camera coordinate system.
59:40.540 --> 59:44.420
So, that's always a vector from a certain point to another point,
59:44.560 --> 59:45.780
yeah, that's this notation.
59:46.500 --> 59:56.820
So, now, we have two vectors represented in, we have this vector
59:56.820 --> 01:00:00.100
represented in the left coordinate system, so this fits to the left
01:00:00.100 --> 01:00:04.300
-hand side of this equation, so we can substitute it by the right-hand
01:00:04.300 --> 01:00:05.280
side of this equation.
01:00:05.280 --> 01:00:11.500
That means, instead of this one, we write this is equal to this
01:00:11.500 --> 01:00:16.120
rotation matrix D times the same vector, but now represented in the
01:00:16.120 --> 01:00:22.300
coordinates of the right-hand side coordinate system, plus B, L, that
01:00:22.300 --> 01:00:27.080
comes from here, minus, well, we do the same story with this vector
01:00:27.080 --> 01:00:31.040
here, so we transform it into coordinates of the right-hand side
01:00:31.040 --> 01:00:35.460
coordinate system, so this is D times F, L, F, R, represented in the
01:00:35.460 --> 01:00:37.680
right camera coordinate system, plus B, L.
01:00:38.720 --> 01:00:43.460
So, now we see B, L occurs here, and minus B, L occurs here, so this
01:00:43.460 --> 01:00:48.260
yields zero, and what, and then we can factor out this rotation matrix
01:00:48.260 --> 01:00:54.780
D, and then we get this term here, and now we see in brackets what we
01:00:54.780 --> 01:01:00.740
do we have, well, we might start at F, R and then go to F, L, so this
01:01:00.740 --> 01:01:04.420
is minus the vector from F, L to F, R, that's equal to the vector that
01:01:04.420 --> 01:01:10.520
starts at F, R and goes to F, L, and then we have the vector from F, L
01:01:10.520 --> 01:01:14.300
to P, so from here to here, so in total we go from here to here to
01:01:14.300 --> 01:01:19.780
here, so, or, and finally, or as the result of this addition, we go
01:01:19.780 --> 01:01:23.440
from F, R to P, so this vector here is nothing else than the vector
01:01:23.440 --> 01:01:28.640
from here to here, the vector F, R to P, represented in the right
01:01:28.640 --> 01:01:32.440
coordinate system, and this is actually nothing else than this vector,
01:01:32.660 --> 01:01:37.300
small p, R, now represented in the right-hand side coordinate system.
01:01:37.640 --> 01:01:42.580
That means, if we want to get rid of this strange vector here,
01:01:42.620 --> 01:01:46.440
represented in the wrong coordinate system, so to say, we just have to
01:01:46.440 --> 01:01:51.080
multiply it with, we have, we can substitute it, sorry, with this
01:01:51.080 --> 01:01:54.260
rotation matrix times the P, R, R vector.
01:01:54.880 --> 01:01:55.980
Okay, let's do that.
01:01:56.700 --> 01:02:00.360
So, we start again with this equation that we derived on the slide
01:02:00.360 --> 01:02:06.300
before, now we substitute this term here, P, R, L by D times P, R, R,
01:02:06.480 --> 01:02:13.440
and then we end up with this term here, and yeah, now we can go on.
01:02:14.020 --> 01:02:18.880
So, we have already derived this relationship between Q, L, and P, L,
01:02:18.980 --> 01:02:25.540
and between Q, R, and P, R, now we can substitute P, R, R at this
01:02:25.540 --> 01:02:33.580
point by actually Z, R, R times Q, R, R, and this point P, L, L can be
01:02:33.580 --> 01:02:34.420
substituted.
01:02:35.120 --> 01:02:39.500
So, if we resolve this equation with respect to P, L, L, it can be
01:02:39.500 --> 01:02:42.560
substituted by Z, L, L times Q, L, L.
01:02:43.740 --> 01:02:48.940
So, then we get this thing here, this equation here, now we see the
01:02:48.940 --> 01:02:54.260
left -hand side should be equal to zero, and both Z, L, L and Z, R, R
01:02:54.260 --> 01:02:56.800
are numbers which are unequal to zero.
01:02:56.960 --> 01:03:00.680
If they would be equal to zero, we would not be able to see them,
01:03:01.260 --> 01:03:01.680
actually.
01:03:02.000 --> 01:03:07.840
So, only points for which these Z values are positive can be seen
01:03:07.840 --> 01:03:08.900
inside of a camera.
01:03:09.540 --> 01:03:12.380
So, these are two positive factors, so we can divide the whole
01:03:12.380 --> 01:03:17.720
equation by these two factors, and what remains after transforming
01:03:17.720 --> 01:03:21.540
this thing looks like that equation.
01:03:21.680 --> 01:03:25.000
So, the Z's disappeared, that's nice, because we don't know them in
01:03:25.000 --> 01:03:29.860
advance, and now we have an equation where only this D, this rotation
01:03:29.860 --> 01:03:35.440
matrix, then this baseline vector B occurs, and two coordinates, two
01:03:35.440 --> 01:03:41.600
image point coordinates, the coordinate of a point, of an image point
01:03:41.600 --> 01:03:43.740
in the left image and in the right image.
01:03:44.740 --> 01:03:52.380
So, now a last step to simplify this notation, what we can see here is
01:03:52.380 --> 01:03:57.620
a cross product, an outer product of P, L and Q, L, and this cross
01:03:57.620 --> 01:04:04.060
product can be reformulated into a matrix times vector multiplication,
01:04:05.980 --> 01:04:06.920
like that.
01:04:07.120 --> 01:04:12.640
So, this cross product of P, L and Q, L is the same as this vector
01:04:12.640 --> 01:04:15.080
times Q, L, L.
01:04:15.400 --> 01:04:24.460
This matrix here contains the entries of the vector P, L, in this way.
01:04:24.640 --> 01:04:28.620
Well, if you don't trust me, try it out and you will find out the
01:04:28.620 --> 01:04:29.860
result is the same.
01:04:30.260 --> 01:04:32.420
Or you found a mistake on the slides.
01:04:35.220 --> 01:04:39.280
Okay, so that's the same, and if we denote this matrix with this
01:04:39.280 --> 01:04:50.720
strange notation, so brackets, B, L, sub cross, then we can write it
01:04:50.720 --> 01:04:51.360
like that.
01:04:51.980 --> 01:04:54.060
Yeah, this is this one here.
01:04:54.640 --> 01:05:05.120
And now, if we use this notation and denote the product of this
01:05:05.120 --> 01:05:10.200
transpose of the rotation matrix times this strange matrix B, L cross
01:05:10.200 --> 01:05:18.140
as E, then we can denote this equation here like this, Q, R, R
01:05:18.140 --> 01:05:22.340
transpose times E times Q, L, L must be equal to zero.
01:05:23.080 --> 01:05:27.760
E is a matrix called essential matrix, and this essential matrix
01:05:27.760 --> 01:05:32.760
actually only contains extrinsic parameters of the binocular camera
01:05:32.760 --> 01:05:33.220
system.
01:05:33.460 --> 01:05:39.160
The relative position and orientation of the right camera with respect
01:05:39.160 --> 01:05:40.000
to the left camera.
01:05:40.300 --> 01:05:42.400
That's what is contained in this matrix.
01:05:42.600 --> 01:05:45.480
So, extrinsic parameters of the camera system.
01:05:45.840 --> 01:05:51.160
We argue, okay, we can calibrate the camera system, then we get to
01:05:51.160 --> 01:05:55.040
know these parameters, and that means once we have done this
01:05:55.040 --> 01:05:58.480
calibration, we can determine this essential matrix E.
01:05:59.220 --> 01:06:03.300
And now what we get is a relationship between points in the left image
01:06:03.300 --> 01:06:04.860
and points in the right image.
01:06:05.180 --> 01:06:11.800
And we know that for all pairs of image points which refer to the same
01:06:11.800 --> 01:06:15.920
object in the three-dimensional world, this equation must hold.
01:06:16.560 --> 01:06:19.260
This equation must hold.
01:06:20.480 --> 01:06:29.240
And that describes this subset in the four-dimensional space of UL,
01:06:29.360 --> 01:06:39.360
VL, UR, VR quadruples, which are only possible to occur in such a
01:06:39.360 --> 01:06:40.620
binocular camera setup.
01:06:41.020 --> 01:06:45.840
Because these are only those positions, that subspace for which this
01:06:45.840 --> 01:06:46.680
equation holds.
01:06:46.940 --> 01:06:51.840
And if we find a pair of points, of image points, in the two images
01:06:51.840 --> 01:06:57.040
for which this equation doesn't hold, we know these two points are not
01:06:57.040 --> 01:07:00.400
referring to the same point in the three-dimensional world.
01:07:01.780 --> 01:07:07.040
So it is a condition that is, so to say, eliminating the fourth
01:07:07.040 --> 01:07:13.020
dimension in the space of UL, VL, UR, VR coordinates.
01:07:14.660 --> 01:07:22.360
Or in other words, instead of memorizing UL, VL, UR, VR, so the full
01:07:22.360 --> 01:07:28.720
position of the image points in the two images, we could just remove
01:07:28.720 --> 01:07:32.740
one of these values and calculate it from this equation, if you like.
01:07:34.740 --> 01:07:38.580
So, okay, so that's the apipolar geometry.
01:07:38.960 --> 01:07:41.080
E is the so-called essential matrix.
01:07:42.280 --> 01:07:49.100
And yeah, that's the basic idea of this geometry of such a binocular
01:07:49.100 --> 01:07:49.900
camera system.
01:07:50.740 --> 01:07:55.820
So that also means if we know the image position in one image, say we
01:07:55.820 --> 01:08:03.180
know QL, and we know this equation, then we can search for possible
01:08:03.180 --> 01:08:06.620
positions of the corresponding point in the right image.
01:08:06.920 --> 01:08:11.860
Because we know for the corresponding point XR, this equation must
01:08:11.860 --> 01:08:12.240
hold.
01:08:13.540 --> 01:08:20.280
That means this search limits our search to the apipolar line in the
01:08:20.280 --> 01:08:20.860
right image.
01:08:21.280 --> 01:08:26.760
Now this equation tells us for this point QL, the apipolar line in the
01:08:26.760 --> 01:08:28.520
right image has this representation.
01:08:29.240 --> 01:08:34.220
And only the corresponding point can only be located on this line and
01:08:34.220 --> 01:08:36.720
nowhere else the right image.
01:08:37.320 --> 01:08:39.900
Of course, this also works in the other direction.
01:08:40.420 --> 01:08:45.080
So if I know a point in the right image, then I can calculate the
01:08:45.080 --> 01:08:48.240
apipolar line that refers to that point in the left image.
01:08:50.620 --> 01:08:54.480
Yeah, that's actually, and of course, yeah, that's what I have seen.
01:08:54.580 --> 01:09:00.720
And of course, the apipoles, these EL and ER points are element of all
01:09:00.720 --> 01:09:01.700
apipolar line.
01:09:01.920 --> 01:09:05.320
They are element of all apipolar lines because it doesn't matter which
01:09:05.320 --> 01:09:08.300
point P we select here.
01:09:08.860 --> 01:09:14.360
This connection between FL and FR is always element of the apipolar
01:09:14.360 --> 01:09:15.000
plane.
01:09:15.680 --> 01:09:20.500
And therefore, the EL and ER are always elements of all apipolar
01:09:20.500 --> 01:09:21.000
lines.
01:09:22.340 --> 01:09:28.600
Okay, so why is it so important to know about this apipolar geometry?
01:09:28.600 --> 01:09:34.560
Well, as I said, so typically we have a look at one image, QL, and we
01:09:34.560 --> 01:09:38.320
determine a point of interest for which we want to determine the three
01:09:38.320 --> 01:09:39.240
-dimensional position.
01:09:39.860 --> 01:09:43.660
And then we need to find the same, the image of the same point in the
01:09:43.660 --> 01:09:45.500
right image.
01:09:46.720 --> 01:09:52.240
And yeah, if we know the apipolar line for that point, then we know we
01:09:52.240 --> 01:09:56.120
only need to search on that line for a corresponding point, nowhere
01:09:56.120 --> 01:09:56.680
else.
01:09:57.120 --> 01:09:58.920
That is important.
01:09:59.700 --> 01:10:03.940
So this restricts the search area for corresponding points.
01:10:05.980 --> 01:10:10.840
Yeah, furthermore, we know that one pair of apipolar lines represents
01:10:10.840 --> 01:10:15.560
all points of the same apipolar plane.
01:10:16.140 --> 01:10:17.500
That's also a result.
01:10:19.400 --> 01:10:26.140
Yeah, so now, the essential matrix and the derivation that we did so
01:10:26.140 --> 01:10:33.900
far was always considering coordinates to represent all points QL and
01:10:33.900 --> 01:10:39.280
QR as three-dimensional points in the respective camera coordinate
01:10:39.280 --> 01:10:39.820
systems.
01:10:40.460 --> 01:10:42.220
Of course, this is not very useful.
01:10:42.520 --> 01:10:46.020
Typically, once we are given an image, we want to calculate in image
01:10:46.020 --> 01:10:46.460
coordinates.
01:10:47.440 --> 01:10:53.040
Now, can we change this equation a little bit so that we can enter
01:10:53.040 --> 01:10:58.660
image coordinates instead of coordinates in the camera coordinate
01:10:58.660 --> 01:10:59.120
systems?
01:10:59.300 --> 01:11:01.260
The answer is yes, and that's easy.
01:11:01.880 --> 01:11:08.880
So, when we have an image coordinate, U and V, given, what we can do
01:11:08.880 --> 01:11:10.940
is we can calculate a line of sight.
01:11:13.200 --> 01:11:18.720
And since we know that the virtual image plane has a distance of 1
01:11:18.720 --> 01:11:25.960
from the focal point of the camera, we can also intersect this line of
01:11:25.960 --> 01:11:27.540
sight with this virtual image plane.
01:11:27.980 --> 01:11:34.980
And by that, we can determine the coordinate QL or QR.
01:11:36.120 --> 01:11:40.360
That means it's always possible from a point in the image to calculate
01:11:40.360 --> 01:11:45.480
the QL and QR position in the image plane, this virtual image plane.
01:11:46.100 --> 01:11:47.660
So, it's a one-to-one mapping.
01:11:48.640 --> 01:11:50.240
And therefore, we can do it.
01:11:50.300 --> 01:11:55.080
We can change things and represent the same equation that we actually
01:11:55.080 --> 01:12:03.300
had, this epipolar condition equation, with using not coordinates in
01:12:03.300 --> 01:12:07.620
the camera coordinate system, but coordinates in the image coordinate
01:12:07.620 --> 01:12:08.120
system.
01:12:08.320 --> 01:12:10.200
And then the equation looks like that.
01:12:10.920 --> 01:12:12.360
Actually, the same structure.
01:12:12.660 --> 01:12:16.320
We use the image coordinates UR, VR, and 1.
01:12:16.800 --> 01:12:22.540
And we use the image coordinates in the other image, UL, VL, and 1.
01:12:22.820 --> 01:12:28.400
And we have a matrix in between, also a three-by-three matrix, like E,
01:12:28.580 --> 01:12:29.520
now called F.
01:12:29.900 --> 01:12:31.760
F for fundamental matrix.
01:12:32.260 --> 01:12:37.280
And this matrix F can be calculated from E in this way, as it is shown
01:12:37.280 --> 01:12:43.340
here with this matrix, matrices of intrinsic parameters, AL and AR for
01:12:43.340 --> 01:12:47.160
the two cameras, as given here.
01:12:48.040 --> 01:12:54.660
And by using this, we can use image coordinates instead of coordinates
01:12:54.660 --> 01:12:58.300
in this virtual image plane.
01:12:58.780 --> 01:13:03.500
So, these camera coordinates in the camera coordinate systems to
01:13:03.500 --> 01:13:04.860
calculate this equation.
01:13:05.400 --> 01:13:09.680
So, then it looks like that, and F is called the fundamental matrix.
01:13:09.860 --> 01:13:14.800
So, if we have the fundamental matrix, and these two matrices AL and
01:13:14.800 --> 01:13:19.200
AR, the intrinsics of the cameras, then we can calculate the essential
01:13:19.200 --> 01:13:20.600
matrix, and vice versa.
01:13:21.100 --> 01:13:25.100
If we have the essential matrix, and these two matrices of intrinsic
01:13:25.100 --> 01:13:28.500
parameters, we can derive the fundamental matrix from it.
01:13:30.240 --> 01:13:35.820
Okay, so, how do we determine them, typically?
01:13:36.160 --> 01:13:40.120
So, I already said, we assume that we have a calibration process.
01:13:40.280 --> 01:13:42.760
We calibrated the binocular camera system.
01:13:43.180 --> 01:13:46.260
This provides to us the intrinsic parameters, that means these
01:13:46.260 --> 01:13:50.580
matrices AL and AR, and it also provides to us the extrinsic
01:13:50.580 --> 01:13:54.340
parameters from which we can determine the essential matrix E.
01:13:54.660 --> 01:13:57.980
And by doing that, we determine F, the fundamental matrix.
01:13:58.720 --> 01:14:02.480
Of course, however, there's also another possibility.
01:14:03.200 --> 01:14:06.420
It's also possible, once we have an image pair, and we have
01:14:06.420 --> 01:14:11.180
corresponding points in these two images, there are also algorithms
01:14:11.180 --> 01:14:15.860
that are able to determine the fundamental matrix without knowing,
01:14:16.640 --> 01:14:19.960
without having calibrated the camera in advance.
01:14:20.160 --> 01:14:24.100
And these are called eight-point algorithms, because we need eight
01:14:24.100 --> 01:14:26.860
-point correspondences to determine that.
01:14:27.600 --> 01:14:31.560
However, of course, if we have set up like a binocular camera, we
01:14:31.560 --> 01:14:35.540
typically prefer to use a calibration process, an explicit calibration
01:14:35.540 --> 01:14:37.140
process in advance.
01:14:38.860 --> 01:14:42.580
Okay, let's have a look at some example images to see how this
01:14:42.580 --> 01:14:45.380
epipolar geometry looks like.
01:14:45.680 --> 01:14:50.920
So, some just artificial images, or not images that I've recorded in
01:14:50.920 --> 01:14:55.840
my office, putting some books on a desk, and then moving the camera
01:14:55.840 --> 01:15:04.240
around and taking pairs of images and checking how do the epipolar
01:15:04.240 --> 01:15:06.500
lines look like.
01:15:07.160 --> 01:15:12.780
And this is the example of a case where I just shifted the camera in a
01:15:12.780 --> 01:15:13.980
horizontal direction.
01:15:14.980 --> 01:15:19.980
And what we can see is that then the epipolar lines that refer to some
01:15:19.980 --> 01:15:22.620
points are more or less horizontal.
01:15:26.440 --> 01:15:29.620
You might ask where is the epipole?
01:15:30.700 --> 01:15:34.340
The epipole is the point where all the epipolar lines intersect.
01:15:35.340 --> 01:15:38.900
And we don't see any point where all the lines intersect here.
01:15:40.120 --> 01:15:40.760
Where is it?
01:15:40.800 --> 01:15:44.620
It exists, but it's outside of this clipping of the image that I have
01:15:44.620 --> 01:15:45.240
shown to you.
01:15:45.620 --> 01:15:48.200
It's far away, but it exists.
01:15:48.460 --> 01:15:53.300
There is a point where all the epipolar lines intersect, but it's very
01:15:53.300 --> 01:15:54.640
far on the side.
01:15:55.900 --> 01:16:00.640
It might also be at infinity and we might have parallel lines in an
01:16:00.640 --> 01:16:02.520
extreme case, or a very special case.
01:16:03.060 --> 01:16:05.540
But here in this case they are converging.
01:16:05.740 --> 01:16:09.860
There is one point of intersection, but it's outside of this area that
01:16:09.860 --> 01:16:10.380
we can see.
01:16:11.280 --> 01:16:14.340
It's part of the image plane, but not part of the image.
01:16:20.440 --> 01:16:22.100
So that's another case.
01:16:22.560 --> 01:16:25.960
Here the shift of the camera was in vertical direction more or less,
01:16:26.060 --> 01:16:27.220
and a little bit of turning.
01:16:28.080 --> 01:16:30.300
And now the situation looks like that.
01:16:30.640 --> 01:16:34.860
And of course we can imagine that the epipole is somewhere here and
01:16:34.860 --> 01:16:37.300
somewhere here on top of both images.
01:16:38.520 --> 01:16:38.780
Yeah?
01:16:39.340 --> 01:16:41.740
So that's a vertical shift.
01:16:42.200 --> 01:16:46.180
That's a combination of horizontal and vertical shift of the camera.
01:16:46.780 --> 01:16:48.640
Looks like that in this case.
01:16:49.220 --> 01:16:52.120
Somehow maybe interesting in this case.
01:16:52.720 --> 01:16:53.600
What happened here?
01:16:54.060 --> 01:17:01.480
Here I took a picture of the books from a closer point of view and
01:17:01.480 --> 01:17:05.080
from a point of view that is a little bit far away from the books.
01:17:05.780 --> 01:17:11.320
You can see here you see the desk, the boundary of the desk.
01:17:11.460 --> 01:17:15.120
Here you don't see it because we are closer to the points.
01:17:15.680 --> 01:17:19.240
And here interestingly the epipoles are part of the image.
01:17:19.560 --> 01:17:23.060
As you can see somewhere more or less in the center of the image.
01:17:23.400 --> 01:17:27.180
Of course it's not possible to create a binocular camera system like
01:17:27.180 --> 01:17:30.840
that because one camera would occlude the other camera.
01:17:31.760 --> 01:17:35.420
Or one camera would occlude the scene for the other camera.
01:17:35.720 --> 01:17:42.060
But having a static scene and making two images one after the other of
01:17:42.060 --> 01:17:43.720
course is possible in this case.
01:17:44.760 --> 01:17:50.260
So these are typical images that show how the epipole lines are
01:17:50.260 --> 01:17:54.000
arranged in such a camera setup.
01:17:55.060 --> 01:18:01.620
So now I think time is up and we continue the next Monday.