WEBVTT

00:09.000 --> 00:15.960
Okay, welcome to another session of the course Algorithms with

00:15.960 --> 00:17.020
Internet Applications.

00:17.240 --> 00:22.680
Today I would like to... well, first of all, I would like to mention a

00:22.680 --> 00:24.700
few things with respect to last week.

00:24.700 --> 00:31.780
Like last week, I couldn't teach here directly, because I was not

00:31.780 --> 00:32.380
here.

00:32.660 --> 00:36.140
I was at a meeting last week.

00:39.460 --> 00:43.080
And actually, it was a meeting having to do also with a topic that I'm

00:43.080 --> 00:45.040
just showing here.

00:45.620 --> 00:49.080
It was on e-mobility and things like that.

00:49.080 --> 00:55.580
So next week, there will be another week where I will not be here.

00:56.200 --> 00:59.820
And my assistant Mark Gurti cannot be here either.

01:00.640 --> 01:06.000
And so I think it would be reasonable to just reuse the recorded

01:06.000 --> 01:07.160
lecture of last year.

01:07.180 --> 01:10.060
Because most of you are anyway not sitting in this classroom.

01:10.060 --> 01:19.700
And so I will just prepare the lecture for next time, such that you

01:19.700 --> 01:23.260
can download the recorded lecture.

01:24.140 --> 01:28.580
And I hope that this is okay for you, those that are sitting here in

01:28.580 --> 01:29.400
the class.

01:30.560 --> 01:32.700
In two weeks, I will be back again.

01:33.500 --> 01:38.020
And after that, there are another two important meetings where I have

01:38.020 --> 01:38.560
to attend.

01:39.300 --> 01:45.860
And it is really... well, on some side, it is a bad thing, because the

01:45.860 --> 01:47.540
teaching suffers a bit.

01:47.700 --> 01:52.280
But I do record the lectures, and so it doesn't suffer that much.

01:52.900 --> 01:57.100
And the good thing certainly is that we are doing research that is

01:57.100 --> 01:59.880
considered to be important, so that we are involved in many projects.

01:59.880 --> 02:06.680
And you certainly also benefit from research that is at the forefront

02:06.680 --> 02:08.660
of everything.

02:09.600 --> 02:15.540
So I hope that you can... or that it is okay with you, if we do it

02:15.540 --> 02:15.860
like that.

02:15.920 --> 02:24.120
We will announce that on the news section of the lecture area in the

02:24.120 --> 02:26.100
teaching portal.

02:27.160 --> 02:28.880
Or the study portal.

02:29.380 --> 02:33.360
So that you know when we actually will have courses.

02:33.540 --> 02:37.400
And I will provide there a list of those dates where I will be there

02:37.400 --> 02:38.800
in any way.

02:39.040 --> 02:42.620
And just a few days where I just cannot teach the course.

02:42.800 --> 02:45.900
And I think it is the best thing to just reuse the recorded lectures

02:45.900 --> 02:47.140
of last year.

02:48.280 --> 02:52.200
Okay, so if you don't object to that, we will do it like that.

02:53.200 --> 02:57.920
So I just would like to start with some invitation to a special

02:57.920 --> 02:58.700
lecture.

02:59.140 --> 03:01.840
Like this is not on mobility systems and things like that.

03:01.940 --> 03:06.040
But you are students of information engineering, students of business

03:06.040 --> 03:06.580
engineering.

03:07.140 --> 03:12.640
And there is an interesting lecture on Thursday, this Thursday, by the

03:12.640 --> 03:14.520
KIT Focus Mobility Systems.

03:14.520 --> 03:21.660
Where you can listen to talks given by quite important persons.

03:21.800 --> 03:29.340
Dr. Thomas Weber, he is the CTO or CRO, you could say, Chief Research

03:29.340 --> 03:32.480
Officer of Daimler AG.

03:33.000 --> 03:37.300
Talking about innovation as a key for the future of mobility.

03:38.140 --> 03:42.940
And then there is another talk by Dr. Gutzmer.

03:43.810 --> 03:46.020
He is from the Schaeffler KG.

03:46.380 --> 03:50.760
I guess you know the Schaeffler KG has been in the media quite a bit.

03:51.200 --> 03:54.720
And he is talking about the requirements on engineers for tomorrow.

03:55.180 --> 03:57.480
Something which should be of interest to students.

03:58.200 --> 04:05.660
And then Frau Bier is telling something about teaching education at

04:05.660 --> 04:06.480
KIT.

04:06.800 --> 04:11.360
So what do we do to educate the engineers for tomorrow.

04:11.360 --> 04:17.500
And finally, Professor Guter Rang will give an overview on research

04:17.500 --> 04:22.400
within the focus on mobility systems.

04:22.600 --> 04:25.680
So you are welcome to go there, if you intend to go there.

04:25.780 --> 04:29.000
Like it's on Thursday afternoon in the Tula lecture hall.

04:29.560 --> 04:34.380
You should send an email to isabelzimmermann at kit.edu.

04:35.380 --> 04:39.820
And then you can not only attend the lecture, but also go to the

04:39.820 --> 04:40.760
reception afterwards.

04:41.300 --> 04:45.160
And get a glass of something to drink and get something to eat.

04:46.000 --> 04:49.080
But for that, it's necessary to know who is coming.

04:49.460 --> 04:52.440
So just this ad, I think it's an interesting talk.

04:54.780 --> 04:58.720
And for students of business engineering, but also for other students,

04:58.860 --> 05:01.040
it will be important to see something like that.

05:01.140 --> 05:03.060
Or it's a good chance to see something like that.

05:03.060 --> 05:09.360
And in mobility, it is very interesting to see that computer science,

05:09.460 --> 05:14.180
informatics, really has significantly increasing importance.

05:14.180 --> 05:18.140
Because you cannot do anything that is really new in that area without

05:18.140 --> 05:20.040
using methods of informatics.

05:21.020 --> 05:28.300
Okay, so this was just a very brief ad for that thing and for that

05:28.300 --> 05:28.780
event.

05:28.780 --> 05:34.140
And then I would like to come to our lecture.

05:34.980 --> 05:37.020
Now, what did we do last week?

05:37.020 --> 05:43.020
Last week we looked at several things.

05:44.540 --> 05:46.540
Oh, that's what happens here.

05:47.440 --> 05:58.920
Okay, so last time we looked at all the different possibilities.

05:59.540 --> 06:00.380
What did I do here?

06:01.740 --> 06:04.420
Something is not correct.

06:05.420 --> 06:06.680
This is strange.

06:07.300 --> 06:12.560
I have to briefly close this.

06:15.100 --> 06:17.640
And open it again.

06:26.690 --> 06:33.930
So, chapter 2, this was... that was the one.

06:40.510 --> 06:42.770
Here we are quite at the end.

06:46.320 --> 06:53.780
So, last time we had looked at the TCP header format or TCP protocol.

06:54.120 --> 06:59.940
We had looked at the different ways of, or the different tasks of TCP.

07:00.300 --> 07:04.740
I have shown you something about the way TCP operates.

07:05.840 --> 07:11.740
How the window size is controlled by looking at the...

07:12.780 --> 07:15.360
well, always observing the number of faults that occur.

07:15.520 --> 07:18.220
And if there are too many faults, then the window size is decreased.

07:18.760 --> 07:21.540
And as long as everything is okay, the window size is increased.

07:21.740 --> 07:24.760
Window size meaning the number of packets you can send before you wait

07:24.760 --> 07:25.460
for an acknowledgement.

07:26.430 --> 07:34.340
Then we looked at the UDP protocol that was the last slide of last...

07:34.340 --> 07:37.000
not last week, but two weeks ago.

07:38.640 --> 07:41.940
And so, what did we look at here?

07:42.500 --> 07:46.640
The UDP is the User Datagram Protocol, which is just extending the

07:46.640 --> 07:53.220
functionality of the IP layer to the application layer.

07:53.220 --> 07:58.720
That means that the application has to deal with the connection and

07:58.720 --> 08:01.840
also with the reliability of the communication.

08:02.160 --> 08:07.500
So, the application has to make sure that if it's necessary, all

08:07.500 --> 08:09.120
datagrams are actually received.

08:09.800 --> 08:13.160
And if there are some datagram losses, the application has to deal

08:13.160 --> 08:13.600
with that.

08:14.340 --> 08:18.520
We know that this is important if we look at certain applications like

08:18.520 --> 08:22.460
audio or video or audio transmission.

08:22.460 --> 08:25.500
Because in particular in audio transmission, you would like to have

08:25.500 --> 08:29.520
continuous flow of data at a specific data rate.

08:29.520 --> 08:31.960
Because you need a certain quality of your audio.

08:32.080 --> 08:36.000
And if there is a frame or some datagram missing, it's not that

08:36.000 --> 08:36.440
important.

08:36.620 --> 08:42.620
It's just a small flicker in the audio or in TV.

08:42.780 --> 08:45.100
It would be just a missing frame, which is not that important.

08:46.060 --> 08:50.180
So, for these applications, UDP is important.

08:50.180 --> 08:56.960
And since in UDP, you don't have to deal with all these things like

08:56.960 --> 08:58.460
acknowledgements and so on.

08:58.560 --> 09:00.220
The UDP header is much smaller.

09:01.020 --> 09:08.360
Sorry, it just has the very few entries which are necessary.

09:08.500 --> 09:11.900
You have to know the connection to the upper layer.

09:12.000 --> 09:13.780
The upper layer is the application layer.

09:13.900 --> 09:15.360
That means you need the source port.

09:15.460 --> 09:18.660
You need the destination port in order to know which applications are

09:18.660 --> 09:19.580
talking to each other.

09:19.580 --> 09:25.580
And you have to know something about the length of the datagram.

09:26.180 --> 09:28.500
Because you have to know how much data you have to look at.

09:29.060 --> 09:35.460
And then there is this checking for some transmission faults, whether

09:35.460 --> 09:41.900
the data is actually still okay and not corrupted by some transmission

09:41.900 --> 09:42.300
faults.

09:42.820 --> 09:43.800
Okay, so this is UDP.

09:44.520 --> 09:49.180
And if you are going to a totally different thing, I just wanted to

09:49.180 --> 09:52.980
briefly mention that there is certainly not only TCP IP.

09:53.180 --> 09:58.280
I said that the Internet is defined by using TCP IP, but there are

09:58.280 --> 09:59.600
also other protocols around.

09:59.740 --> 10:03.140
For example, the ATM asynchronous transfer mode was a completely

10:03.140 --> 10:07.220
different protocol, or is a completely different protocol, where you

10:07.220 --> 10:11.340
have a different philosophy behind sending data.

10:12.360 --> 10:18.580
It's also packet oriented, but it uses virtual channels.

10:18.960 --> 10:24.700
That means if you have several nodes, several possible paths, then

10:24.700 --> 10:31.880
like if this would be a path or that would be a path in ATM, you would

10:31.880 --> 10:35.280
decide for a specific path that has to be followed.

10:35.600 --> 10:40.560
That means you have a virtual channel and all the packets that are

10:40.560 --> 10:47.020
sent in the ATM protocol are sent over the same sequence of hops.

10:48.000 --> 10:49.680
The cell size is fixed.

10:50.220 --> 10:54.900
You have very small packets, just 5 bytes in the header and 48 bytes

10:54.900 --> 10:55.360
of data.

10:56.360 --> 11:03.640
And so there is not very much overhead for routing control.

11:04.340 --> 11:13.040
And then sometimes, so it is possible to send data over different

11:13.040 --> 11:16.180
virtual channels, then the order would not be guaranteed.

11:16.560 --> 11:20.960
Usually if you send something over the same channel, you don't have to

11:20.960 --> 11:28.040
deal with the sequence of reordering the datagrams, because normally

11:28.040 --> 11:31.100
they would be guaranteed in their order.

11:31.100 --> 11:36.720
The important point about ATM is that you can book certain service

11:36.720 --> 11:37.200
classes.

11:37.420 --> 11:39.460
You can book a connection.

11:39.720 --> 11:43.380
You can say, I would like to have a connection having constant

11:43.380 --> 11:43.900
bitrate.

11:44.040 --> 11:47.180
That means you have a certain application where you need a certain

11:47.180 --> 11:50.080
bitrate and this can be guaranteed by ATM.

11:50.700 --> 11:54.640
Then all the hops on the virtual channel have to guarantee that they

11:54.640 --> 11:57.900
can process data at a specific rate.

11:57.900 --> 12:03.740
And so this is the highest thing that you could like to get.

12:06.060 --> 12:07.920
Constant transmission rate.

12:08.680 --> 12:12.660
Then there is something which is not as strict, would be real-time

12:12.660 --> 12:14.220
variable bitrate.

12:14.500 --> 12:18.600
That means you have some tolerance with respect to the bitrate, but it

12:18.600 --> 12:25.160
should still be within certain time frames or time intervals, certain

12:25.160 --> 12:30.180
deadlines you should have in real time, the right frequency of your

12:30.180 --> 12:33.380
data transmissions.

12:34.180 --> 12:37.840
Then the next thing would be variable bitrate and non-real time.

12:37.960 --> 12:44.780
That means you don't have strict time deadlines for getting datagrams.

12:45.540 --> 12:48.160
The next thing would be available rate.

12:48.360 --> 12:52.560
That means if there are many connections using the same virtual

12:52.560 --> 12:57.400
channel or parts of the same virtual channel, then you would just like

12:57.400 --> 13:00.320
to get the available bitrate at all times.

13:01.360 --> 13:07.940
And so if somebody else has, for example, variable bitrate real time,

13:08.380 --> 13:11.600
and you are an additional user having available bitrate, then

13:11.600 --> 13:14.160
certainly sometimes you would get a very low bitrate.

13:14.760 --> 13:18.320
And the lowest level of quality would be unspecified bitrate.

13:18.720 --> 13:23.280
You just would like to transfer data, but there's no time restriction.

13:23.800 --> 13:26.800
You can just send it at some time where it is possible.

13:27.660 --> 13:34.080
And this is something like a scheduling thing in batch mode, for

13:34.080 --> 13:35.760
example, in job scheduling.

13:36.320 --> 13:37.780
You don't have time restrictions.

13:37.940 --> 13:41.240
You just want to get a certain job done, or in this case you want to

13:41.240 --> 13:45.120
get some information across, but you don't mind whether it is in

13:45.120 --> 13:48.480
between, sometimes not a very high rate.

13:48.480 --> 13:55.160
So if you have high quality restrictions, then you can ask for

13:55.160 --> 13:56.600
specific bitrates.

13:56.980 --> 14:02.180
This usually is not possible with TCP, as we know that we have the

14:02.180 --> 14:10.880
slow start protocol, where we have a continuous adaptation of the

14:10.880 --> 14:15.960
bitrate to the current situation in the network, and you don't get

14:15.960 --> 14:17.740
guaranteed transmission rates.

14:18.660 --> 14:25.280
So when ATM was established, it had something like 155 megabit per

14:25.280 --> 14:29.060
second, which was very fast, and initially people thought that this

14:29.060 --> 14:38.740
would actually be, or would take over from TCP IP, but the technology

14:38.740 --> 14:46.040
for TCP IP also improved, and so it is still there, but the switches,

14:46.040 --> 14:51.660
like the switches which were necessary for handling all this traffic,

14:52.100 --> 14:58.520
were quite expensive, and so the technology did not really get the

14:58.520 --> 15:01.120
momentum that was initially thought it would get.

15:01.840 --> 15:04.620
But this point here certainly is important.

15:04.960 --> 15:10.300
Nowadays we have very high bandwidth on transmission channels, and so

15:10.300 --> 15:15.200
even if we don't have these guaranteed bitrates, the effective

15:15.200 --> 15:21.500
transmission rate is quite good, and the protocols for using UDP for

15:21.500 --> 15:26.520
audio and video transmission are quite sophisticated meanwhile, so

15:26.520 --> 15:32.100
that you get appropriate transmission rates, even if you don't use

15:32.100 --> 15:36.080
these ATM service classes.

15:36.620 --> 15:40.600
Okay, I just wanted to mention that briefly, I didn't want to go into

15:40.600 --> 15:45.020
details of ATM above the level that I just presented to you.

15:45.740 --> 15:53.920
So it's time to look at different aspects of the Internet.

15:54.200 --> 15:56.900
We are talking here about algorithms for Internet applications.

15:57.320 --> 16:02.340
So far we looked at algorithms within the Internet, and so I would now

16:02.340 --> 16:08.120
like to come to the chapter which deals with one of the major

16:08.120 --> 16:12.760
applications of the Internet, which is searching for information.

16:13.620 --> 16:19.240
Like for most of us, the Internet is visible in the form of the World

16:19.240 --> 16:25.700
Wide Web, and there we have just such an enormous amount of

16:25.700 --> 16:29.280
information that it's necessary to be able to search for that

16:29.280 --> 16:33.460
information in a reasonable way, and also to advertise your

16:33.460 --> 16:34.840
information in a reasonable way.

16:35.440 --> 16:39.680
And the thing that you would really like to get is something like a

16:39.680 --> 16:44.460
challenge is phrased, or can be phrased in the following way.

16:49.380 --> 17:01.900
Every information should be at most two clicks away.

17:06.290 --> 17:08.930
Just think about that for a moment.

17:09.090 --> 17:13.150
You are looking for any information, and you would like to have a

17:13.150 --> 17:17.430
situation where you just have to perform at most two clicks to get the

17:17.430 --> 17:19.130
information that you are interested in.

17:20.810 --> 17:24.350
And you don't want to have a sequence of ten or something clicks

17:24.350 --> 17:28.430
before you get your information, but that means you need very

17:28.430 --> 17:33.690
effective ways of actually accessing the adequate information.

17:34.470 --> 17:37.590
Now I don't know why this all of a sudden switched to black.

17:37.590 --> 17:41.990
I have to switch the color back because I'm preferring the red.

17:42.090 --> 17:48.470
I don't know what you would prefer, but I'm used to using red as the

17:48.470 --> 17:49.330
annotation color.

17:49.930 --> 17:52.570
Now let's look at this topic, searching for information.

17:53.830 --> 17:56.070
I know that you have looked at that already.

17:56.270 --> 18:02.890
All of you are quite familiar with using search engines, using the

18:02.890 --> 18:03.610
World Wide Web.

18:03.610 --> 18:10.170
And so there are something like 300 million web searches per day.

18:10.790 --> 18:12.930
I should update that number.

18:13.250 --> 18:16.650
I think it will have increased meanwhile.

18:17.250 --> 18:19.070
I didn't write it this year.

18:19.170 --> 18:23.310
I wrote it last year or maybe the year before.

18:23.490 --> 18:25.590
I update these things now and then.

18:25.890 --> 18:30.790
But it's just obvious that there are many, many searches that are

18:30.790 --> 18:32.350
going on at the same time.

18:32.350 --> 18:36.610
And we know that quite a bit, for example, of energy is used just for

18:36.610 --> 18:38.350
performing all these searches.

18:39.030 --> 18:44.890
So it is definitely one of the most popular applications of the

18:44.890 --> 18:51.010
Internet after email, after all kinds of short messaging and so on.

18:51.630 --> 18:55.050
So we should look at what a search engine actually is.

18:55.190 --> 18:59.610
The search engine essentially has the task of retrieving information

18:59.610 --> 19:01.370
from some sources.

19:01.370 --> 19:04.790
So where is the information usually located?

19:05.870 --> 19:10.230
The information is in some textual documents.

19:10.470 --> 19:18.290
So we look for some text normally, whereas sometimes you can also now

19:18.290 --> 19:23.150
look for images, for audio, for video content.

19:23.450 --> 19:27.750
But normally what we look for is just for textual documents.

19:27.750 --> 19:33.430
And then in this information retrieval, like this is the huge world of

19:33.430 --> 19:34.550
all the data.

19:34.970 --> 19:37.690
And what you would like to have is some kind of index.

19:38.230 --> 19:43.230
As, for example, in a book you have a list of all the relevant items.

19:43.370 --> 19:49.470
And you can look up that list of items in your book in the index at

19:49.470 --> 19:49.850
the end.

19:49.850 --> 19:55.110
And there you have a pointer to the page where that item was actually

19:55.110 --> 20:02.590
explained or used or all the pages where that item actually is

20:02.590 --> 20:03.050
located.

20:03.290 --> 20:07.970
So you would like to be able to go to some index and this should lead

20:07.970 --> 20:12.670
you to a certain document where this information is relevant.

20:13.730 --> 20:15.890
So that's something like an index.

20:15.890 --> 20:20.130
You need some possibility to analyze the search phrase.

20:20.710 --> 20:26.310
You would like to formulate queries to a search engine and this should

20:26.310 --> 20:28.010
be analyzed in a reasonable way.

20:28.770 --> 20:33.050
And what you get should be something which actually corresponds to

20:33.050 --> 20:33.690
your intention.

20:34.250 --> 20:36.990
You have a certain question that you like to ask.

20:36.990 --> 20:39.790
You look for certain information and the information that you get,

20:40.370 --> 20:47.850
like this long list of results of a search engine query, that should

20:47.850 --> 20:51.850
really be organized in a way that the most important one is on the top

20:51.850 --> 20:54.750
and then some others are not as important.

20:54.930 --> 21:00.630
So the further you go down, the less important those documents

21:00.630 --> 21:02.370
normally will be.

21:02.370 --> 21:04.650
And you know that this is not always the case.

21:04.790 --> 21:08.670
Sometimes you have something on top here which is not really relevant

21:08.670 --> 21:11.730
but a paid entry.

21:12.270 --> 21:16.810
So some information providers just pay for being listed as the first

21:16.810 --> 21:19.810
entries as a result of search engine queries.

21:20.390 --> 21:21.730
Things like that are important.

21:21.830 --> 21:24.470
We have to look at how could we actually measure the relevance of

21:24.470 --> 21:25.330
retrieved documents.

21:25.890 --> 21:28.410
The standard task in information retrieval.

21:29.230 --> 21:33.010
Now there is a huge number of search engines.

21:34.410 --> 21:39.950
And although I guess that most of you only use one, who has ever used

21:39.950 --> 21:42.430
a search engine which has not been Googled?

21:44.130 --> 21:45.850
Okay, I think I asked that before.

21:46.890 --> 21:48.290
So, you did use.

21:48.390 --> 21:49.270
Which one did you use?

21:50.650 --> 21:50.850
Hmm?

21:51.110 --> 21:51.330
Yahoo.

21:51.850 --> 21:52.210
Okay.

21:53.270 --> 21:53.830
Other ones?

21:55.470 --> 21:55.990
Hmm?

21:55.990 --> 21:56.510
Bing.

21:56.770 --> 21:56.970
Okay.

21:56.970 --> 22:01.870
Okay, so there are lots of search engines.

22:02.450 --> 22:06.810
And it is interesting to know what those search engines can do.

22:07.050 --> 22:10.630
So there are certainly many, many sites which would give you

22:10.630 --> 22:12.010
information on search engines.

22:13.050 --> 22:17.030
But I don't know why I... what did I do here?

22:18.510 --> 22:19.410
This is interesting.

22:19.990 --> 22:22.730
It's also an interesting search engine, but I wanted to go to a

22:22.730 --> 22:23.370
different site.

22:23.370 --> 22:26.370
I don't know why I got to that one.

22:29.710 --> 22:30.230
Ah.

22:30.230 --> 22:31.210
This is funny.

22:31.970 --> 22:33.650
Here I have the MetaCrawler in there.

22:34.490 --> 22:38.450
I don't know how that happened that the link is not searchenginewatch

22:38.450 --> 22:40.170
.com but MetaCrawler.

22:40.410 --> 22:43.790
But MetaCrawler is an interesting example.

22:43.910 --> 22:45.350
I can briefly switch to that.

22:45.710 --> 22:48.590
Back to that and then get to the one I wanted to get at.

22:48.590 --> 22:52.690
So MetaCrawler is interesting because it combines all kinds of search

22:52.690 --> 22:53.050
engines.

22:53.590 --> 22:55.740
And if you ask a query here...

22:56.510 --> 23:00.750
So if I would look for whatever...

23:00.750 --> 23:02.110
What should I look for?

23:03.230 --> 23:03.750
Pardon?

23:04.750 --> 23:05.430
Algorithms.

23:05.570 --> 23:05.750
Okay.

23:07.170 --> 23:07.470
Oops.

23:08.850 --> 23:09.550
Algorithms.

23:09.550 --> 23:19.650
Then it would look at a large number of different...

23:21.330 --> 23:28.010
So this is results from Google, Yahoo, Bing, Ask, and some more.

23:29.110 --> 23:35.670
And so here you have a listing of those results which looks different

23:35.670 --> 23:38.290
from what you see at Google.

23:38.290 --> 23:41.470
But this is the combination of results from several search engines.

23:41.570 --> 23:44.750
I didn't want actually to show you this but I wanted to show you

23:44.750 --> 23:49.510
something having to do with searchenginewatch.com.

23:50.270 --> 23:53.070
So what is searchenginewatch.com?

23:53.370 --> 23:57.250
Searchenginewatch.com is an information site which is very, very

23:57.250 --> 23:57.810
valuable.

23:58.310 --> 24:02.790
Because there you get all kinds of information on search engines.

24:03.550 --> 24:06.510
Now it's, as you see here, it's also filled with ads.

24:06.510 --> 24:09.790
So because it's free information, at least to some extent.

24:10.430 --> 24:15.630
There's also some area where you only get as a member.

24:15.850 --> 24:19.990
You can subscribe to that service and get more in-depth information.

24:20.930 --> 24:26.090
And the interesting things are you get information, for example, on...

24:26.090 --> 24:30.150
Let me... like here on search engine marketing.

24:30.670 --> 24:31.190
Search.

24:31.490 --> 24:32.550
Why marketing?

24:32.550 --> 24:37.390
Now one of the biggest businesses nowadays is getting your information

24:37.390 --> 24:39.450
to as many people as possible.

24:40.370 --> 24:44.830
And so this is giving all kinds of information how you should design

24:44.830 --> 24:50.350
your webpages such that you get the highest relevance that is

24:50.350 --> 24:52.190
reasonable for your webpage.

24:52.350 --> 24:54.290
So this has to do with marketing.

24:55.030 --> 24:58.830
How can you actually exploit the mechanisms of search engines in order

24:58.830 --> 25:00.310
to get top ranks?

25:01.310 --> 25:05.190
So when we talk about search engine optimization, for example, one

25:05.190 --> 25:05.890
link here.

25:06.510 --> 25:13.350
Search engine optimization means you have to not optimize the search

25:13.350 --> 25:18.310
engine, but you have to optimize your documents with respect to the

25:18.310 --> 25:23.010
mechanisms of search engines such that you get high ranks for your

25:23.010 --> 25:26.090
information, such that you actually get found.

25:27.090 --> 25:32.190
Now, so there are all kinds of things that you could look at.

25:32.890 --> 25:40.230
Analytics and return on investment shows you how you can configure out

25:40.230 --> 25:42.530
all kinds of things.

25:42.650 --> 25:45.950
I would like to just show you briefly other things here.

25:46.550 --> 25:50.550
There are areas like ratings and statistics.

25:50.550 --> 25:53.690
And these ratings are quite interesting.

25:55.550 --> 25:59.450
It doesn't make sense to put all this in my slides, but because this

25:59.450 --> 26:01.930
is every year, it is updated stuff.

26:02.690 --> 26:11.790
So, for example, the top search providers of information in the United

26:11.790 --> 26:14.890
States here in September 2010.

26:15.630 --> 26:21.590
Here is a statistic of what the market share is of the different

26:21.590 --> 26:27.590
search engines with respect to searches that have been performed in

26:27.590 --> 26:28.550
the United States.

26:29.530 --> 26:35.250
So you see that Google has a market share of 65.4 percent.

26:36.050 --> 26:37.390
No, 66.1.

26:37.450 --> 26:38.390
It decreased a bit.

26:38.590 --> 26:41.550
Yahoo dropped a bit to 16.7.

26:41.550 --> 26:45.650
Microsoft, that is Bing certainly, has 11.2.

26:45.850 --> 26:49.750
It increased over several months quite significantly because Bing, for

26:49.750 --> 26:51.730
example, was a rather new search engine.

26:52.470 --> 26:55.550
But so now it's something like 11 percent.

26:56.150 --> 26:56.410
Ask.

26:56.910 --> 26:58.310
Who of you has ever used Ask?

26:59.330 --> 27:00.750
Okay, so you know something.

27:01.530 --> 27:01.750
Pardon?

27:02.610 --> 27:02.770
Pardon?

27:04.410 --> 27:05.490
Yes, I know.

27:05.490 --> 27:10.830
They had a very good service, and I really liked that, but it's not

27:10.830 --> 27:14.650
really that much use.

27:14.870 --> 27:17.610
And then we have the AOL still has 2.3.

27:17.730 --> 27:21.070
This will be slightly different in different countries, but this is

27:21.070 --> 27:24.210
the information from the United States.

27:25.110 --> 27:28.030
And so you get all kinds of statistics.

27:28.890 --> 27:32.970
For example, something like the top U.S.

27:33.030 --> 27:33.810
search terms.

27:33.810 --> 27:39.530
What were the search terms that have been looked at the most in July

27:39.530 --> 27:40.370
2010?

27:42.010 --> 27:48.950
And so this is in the States with categories IT and Internet.

27:49.270 --> 27:53.450
There you have here these many things like to go for PayPal.

27:54.450 --> 28:01.050
Automotive manufacturers Ford had marginally on the front with 1.79.

28:01.550 --> 28:04.930
Search volume just a bit ahead of Toyota.

28:05.470 --> 28:07.510
You see all the different makes here.

28:08.710 --> 28:13.570
And then you have all the different categories and the search terms

28:13.570 --> 28:14.770
that people actually look for.

28:16.450 --> 28:18.050
So this is information.

28:18.170 --> 28:19.090
Why is this important?

28:19.090 --> 28:25.750
Because if you know what people look for, then you can optimize your

28:25.750 --> 28:30.870
pages such that terms that other people might be interested in are

28:30.870 --> 28:33.090
also on your pages.

28:34.850 --> 28:39.350
So this is a way if you talk about search engine optimization,

28:39.510 --> 28:40.730
optimizing your pages.

28:41.130 --> 28:44.930
It's important to know what are the terms that people are the most

28:44.930 --> 28:45.750
interested in.

28:45.750 --> 28:49.270
Because then you would like to put some of that information on your

28:49.270 --> 28:52.870
pages in order to get the people to your pages.

28:54.050 --> 28:56.910
And so this has to do with marketing.

28:57.670 --> 29:04.770
And if we look at different rating, for example, in the UK, here we

29:04.770 --> 29:13.370
have the top search terms in the UK, for example, in April 2010.

29:14.370 --> 29:19.810
And you will see that those terms are slightly different here in the

29:19.810 --> 29:20.230
UK.

29:20.990 --> 29:28.490
Audi is the top automotive brand here.

29:28.830 --> 29:32.450
And BMW and Mercedes-Benz and so on.

29:32.650 --> 29:40.190
And Renault and Volkswagen, which did not appear on the US term

29:40.190 --> 29:40.610
listing.

29:40.610 --> 29:44.830
So I just wanted to show you there are lots of information available

29:44.830 --> 29:49.070
on this website searchenginewatch.com.

29:51.050 --> 29:55.270
And I don't want to go too much into these details here.

29:55.270 --> 30:03.530
But if you are getting a task later on sometime, maybe in your job to

30:03.530 --> 30:11.770
deal with optimizing the way your company is available on the World

30:11.770 --> 30:15.070
Wide Web, then this is very valuable information.

30:15.270 --> 30:21.190
And then certainly you can also look into the more sophisticated

30:21.190 --> 30:23.770
information that is only available to the members.

30:23.770 --> 30:27.510
Okay, so I would like to get back to my slides.

30:28.090 --> 30:35.250
So this is all the brief description of the searchenginewatch.com and

30:35.250 --> 30:37.630
a source for interesting information.

30:37.830 --> 30:41.010
Now what is search engines or what do we talk about when we talk about

30:41.010 --> 30:41.730
search engines?

30:41.730 --> 30:48.730
Search engines are pieces of software that have to visit websites on

30:48.730 --> 30:52.590
the Internet, on the World Wide Web, in order to create catalogs of

30:52.590 --> 30:53.170
web pages.

30:54.170 --> 30:59.770
So we have to in some way find out what kind of information is

30:59.770 --> 31:00.170
available.

31:00.730 --> 31:04.510
So all the sites that are providing information have to be visited by

31:04.510 --> 31:05.190
a search engine.

31:05.190 --> 31:10.670
And you know that there are lots of such sites available, so quite an

31:10.670 --> 31:16.150
enormous task to visit all the sites that provide information on a

31:16.150 --> 31:17.390
regular basis.

31:18.010 --> 31:19.930
They have to be updated constantly.

31:20.170 --> 31:23.390
We will come back to the requirements of that a bit later.

31:23.610 --> 31:27.170
So if we have to update the information, it means we have to revisit

31:27.170 --> 31:29.310
all the different pages again and again.

31:29.910 --> 31:33.370
And it's interesting to see how that can be done.

31:33.370 --> 31:39.090
Certainly it has to be possible to do that without human interference.

31:39.450 --> 31:46.270
It has to be done in an autonomous way because it's just such a large

31:46.270 --> 31:50.830
number that it's impossible for a human to look at all this

31:50.830 --> 31:51.930
information directly.

31:52.570 --> 31:59.410
Nevertheless, there are directories which are built with human

31:59.410 --> 31:59.910
interference.

31:59.910 --> 32:01.770
And why is that necessary?

32:01.990 --> 32:09.950
It's necessary because we would like to have certain quality levels in

32:09.950 --> 32:13.950
the information that is displayed in these directories.

32:14.250 --> 32:18.950
So these directories are manually created catalogs of web pages.

32:19.770 --> 32:23.750
And that means the sites have to be submitted.

32:23.750 --> 32:27.730
If you would like to get onto a directory with your information, you

32:27.730 --> 32:30.970
have to submit your web page and then it is rated.

32:31.990 --> 32:37.470
And then you are hopefully assigned to the appropriate category.

32:37.470 --> 32:43.110
And if you are listed there, people know this is higher quality

32:43.110 --> 32:47.370
information than the information that is just in catalogs where you

32:47.370 --> 32:52.650
have not manually but automatically created content.

32:54.090 --> 32:58.290
So Yahoo is a directory.

32:58.450 --> 32:59.670
Initially it was a directory.

32:59.970 --> 33:03.450
Now it is extended with a search engine having also automatically

33:03.450 --> 33:04.430
created content.

33:05.290 --> 33:11.070
So Yahoo is a hybrid search engine, but you have manually classified

33:11.070 --> 33:15.190
content and you have automatically created extensions of that.

33:16.210 --> 33:19.090
So these are the differences.

33:19.690 --> 33:23.970
Search engines provide automatically catalogs and then you have these

33:23.970 --> 33:26.750
directories of manually created information.

33:27.570 --> 33:32.610
Now let's look at the way the search engines actually work.

33:32.690 --> 33:34.530
How can they get all that information?

33:35.270 --> 33:42.410
You need some tool for accessing the information on the web.

33:42.870 --> 33:50.270
So you have all different sites in the network where information is

33:50.270 --> 33:51.170
located.

33:51.170 --> 33:54.270
And now you have to visit all these sites.

33:54.590 --> 34:00.910
So this is done by a tool called a spider or a crawler, which is

34:00.910 --> 34:05.690
visually or virtually crawling from one website to the other.

34:06.350 --> 34:10.090
And the way it is crawling is determined by, for example, the links

34:10.090 --> 34:12.390
that are found in the web pages.

34:12.870 --> 34:17.190
Or there may be some other systematic ways of actually visiting all

34:17.190 --> 34:19.990
the possible nodes in the web.

34:19.990 --> 34:26.370
So the spider visits the web page, reads and analyzes it, and then

34:26.370 --> 34:28.490
follows links to other pages within the site.

34:28.570 --> 34:32.510
Now when I say it is visiting all these different sites, what it's

34:32.510 --> 34:36.430
actually doing certainly is it's sitting on some server and then

34:36.430 --> 34:39.590
accessing a website somewhere, a document.

34:40.170 --> 34:45.310
It gets that information, it analyzes the information on the server

34:45.310 --> 34:48.390
where it's sitting, and then it's accessing the next page.

34:48.390 --> 34:54.390
So we have not software that is going to all these different sites and

34:54.390 --> 34:55.490
being executed there.

34:55.930 --> 34:59.950
But it is like as a spider would do, it would just walk over all these

34:59.950 --> 35:00.430
nodes.

35:00.630 --> 35:04.910
But it is retrieving information, looking at the information, and then

35:04.910 --> 35:08.490
asking for the next information from other nodes.

35:09.250 --> 35:15.110
So it follows normally the links within those pages to other pages

35:15.110 --> 35:17.530
within the site and also to other sites.

35:18.030 --> 35:21.890
And in that way it systematically looks at all the information that is

35:21.890 --> 35:25.090
on these websites.

35:26.210 --> 35:29.250
And then it has to return to the site on a regular basis.

35:29.670 --> 35:33.290
Normally there are fixed intervals for revisiting a web page or

35:33.290 --> 35:33.790
website.

35:33.790 --> 35:36.930
Maybe something like every three weeks or four weeks.

35:37.710 --> 35:42.410
And some spiders would do that in an adaptive way.

35:42.550 --> 35:48.050
If they notice changes, they will visit a page more often than a page

35:48.050 --> 35:51.670
that has not changed over several visits.

35:51.790 --> 35:56.890
Then the visit interval will be made a bit longer.

35:57.880 --> 35:58.230
Okay.

35:58.830 --> 36:00.150
Then we have the index.

36:00.310 --> 36:04.210
The information that is retrieved has to be put in some index or the

36:04.210 --> 36:06.650
catalog of the search engine.

36:07.590 --> 36:10.710
So I mentioned that already.

36:10.930 --> 36:18.910
It is just like some giant book where you can just go or find the

36:18.910 --> 36:24.390
documents for certain terms that are mentioned in documents in the

36:24.390 --> 36:24.630
web.

36:24.630 --> 36:29.970
And this index has to be updated whenever web page changes are

36:29.970 --> 36:30.370
detected.

36:30.670 --> 36:33.870
And as we all know, web pages are changing quite a bit.

36:34.690 --> 36:40.690
And so this is something which is a tremendous task to have an index

36:40.690 --> 36:44.650
always updated and as current as possible.

36:46.730 --> 36:51.830
Now certainly only the pages listed in the index are available for

36:51.830 --> 36:52.250
searching.

36:52.250 --> 36:57.490
So if you put up a page on your website and this website has never

36:57.490 --> 37:00.750
been visited by a search engine, then this information will not be

37:00.750 --> 37:05.470
visible as a result to a query performed by some search engine.

37:05.930 --> 37:11.190
You must make sure that your web pages actually get indexed by some

37:11.190 --> 37:11.950
search engine.

37:12.110 --> 37:16.590
You can do that either by submitting or by being visited in an

37:16.590 --> 37:18.630
automatic way by some search engine.

37:19.470 --> 37:22.390
And certainly you need very efficient data.

37:22.990 --> 37:26.490
We know or I don't know whether you have heard, but probably you have

37:26.490 --> 37:34.310
heard that the search engine providers like Google, for example, they

37:34.310 --> 37:44.410
operate huge data centers which are just enormously large structures

37:44.410 --> 37:48.870
to support searching such that you get information back in within very

37:48.870 --> 37:50.610
few seconds, within milliseconds.

37:51.390 --> 37:58.470
And just to show, for example, in Seattle they built a large search

37:58.470 --> 38:05.670
engine farm of servers that would answer queries.

38:06.610 --> 38:12.010
And this service installation, this data center, has about one third

38:12.010 --> 38:14.850
of the energy consumption of the whole city of Seattle.

38:15.610 --> 38:21.830
So it's a huge amount of energy that is necessary for that.

38:22.390 --> 38:26.690
And this is a topic which is also very interesting to look at how we

38:26.690 --> 38:30.430
can actually minimize the energy that is necessary to access

38:30.430 --> 38:31.070
information.

38:31.430 --> 38:34.750
But this is a topic which I don't address in this course.

38:35.410 --> 38:40.590
And then we have the search engine software around the spider and the

38:40.590 --> 38:41.030
index.

38:41.310 --> 38:43.330
How can we actually utilize the index?

38:44.070 --> 38:48.490
We, or you as a user, formulate a query.

38:48.710 --> 38:50.750
This has to be analyzed in an appropriate way.

38:50.970 --> 38:57.690
And then the information that is looked for has to be detected.

38:57.950 --> 39:01.490
So the index is searched for documents matching the query or for links

39:01.490 --> 39:02.230
to documents.

39:02.750 --> 39:07.090
And then these documents have to be ranked in order of relevance.

39:07.790 --> 39:11.410
This is another difficult task.

39:11.450 --> 39:12.450
How can we do that?

39:12.910 --> 39:18.330
Find out what the most relevant answer to a query actually is.

39:19.210 --> 39:24.370
In order to do that, we usually have several measures for measuring

39:24.370 --> 39:26.810
the quality of information retrieval.

39:27.870 --> 39:30.610
These are measures which are quite old.

39:31.210 --> 39:34.970
They have been introduced for information retrieval.

39:34.970 --> 39:39.430
Two standard measures are recall and precision.

39:40.550 --> 39:46.270
And they address the following quality criteria.

39:46.970 --> 39:54.550
So if we ask a certain thing, we formulate or we submit a query to a

39:54.550 --> 39:56.730
search engine, we get certain results.

39:57.430 --> 40:02.850
And now if you get a certain list of documents or links to documents

40:02.850 --> 40:07.030
as a result to a query, and we know that sometimes you get documents

40:07.030 --> 40:08.230
which are not really relevant.

40:08.790 --> 40:12.450
Now let's assume that we can actually make a clear binary distinction

40:12.450 --> 40:14.690
between relevant and not relevant.

40:15.210 --> 40:18.930
And then if we look at all the possible documents, this is here all

40:18.930 --> 40:22.670
the possible documents that we can retrieve at all.

40:22.670 --> 40:27.150
There we have this binary decision on relevant or not relevant.

40:28.030 --> 40:33.230
Normally, we have something like relevant indicator is a numerical

40:33.230 --> 40:33.790
value.

40:34.390 --> 40:37.730
But you could say I have a threshold, and if this numerical value is

40:37.730 --> 40:39.850
above a certain threshold, I say this is relevant.

40:40.010 --> 40:41.910
If it's below that, it's not relevant.

40:42.510 --> 40:42.930
It's irrelevant.

40:43.650 --> 40:47.090
And now you get a certain set of retrieved documents.

40:47.830 --> 40:51.370
And in this set of retrieved documents, you might have some which are

40:51.370 --> 40:51.830
irrelevant.

40:52.410 --> 40:55.130
You certainly don't want to get irrelevant documents.

40:55.990 --> 41:00.710
So if the percentage of irrelevant documents is too high, the quality

41:00.710 --> 41:03.350
of your query result is bad.

41:04.190 --> 41:09.530
Then you certainly want to get, if possible, all the relevant

41:09.530 --> 41:10.030
documents.

41:10.470 --> 41:14.570
But as you can see here, you only get a certain subset of all the

41:14.570 --> 41:15.290
relevant documents.

41:16.030 --> 41:19.990
And so these are the two things that are looked at.

41:20.450 --> 41:25.290
The recall actually measures how many of the relevant documents are

41:25.290 --> 41:27.930
actually delivered as a result to a query.

41:28.450 --> 41:34.330
So you look at the number of relevant retrieved documents with respect

41:34.330 --> 41:36.770
to the total number of all relevant documents.

41:37.070 --> 41:39.710
So that is exactly what I just mentioned.

41:39.710 --> 41:45.310
You look at what is the percentage of the retrieved relevant documents

41:45.310 --> 41:48.090
with respect to all those that are actually relevant.

41:48.750 --> 41:50.590
And the other measure is the precision.

41:51.510 --> 42:02.290
If you look at your query result, you look at how many relevant or how

42:02.290 --> 42:05.290
many of the retrieved documents actually have been relevant.

42:05.290 --> 42:10.030
So you look at the percentage of retrieved relevant documents within

42:10.030 --> 42:17.030
your result or response to the query.

42:18.250 --> 42:22.770
So precision means how many bad documents did I get which are not

42:22.770 --> 42:23.270
relevant.

42:24.090 --> 42:28.730
Recall means how many of the really good documents did I actually get.

42:29.610 --> 42:34.510
Now, if you try to provide a search engine service, you certainly

42:34.510 --> 42:38.890
would like to provide high quality with respect to precision and

42:38.890 --> 42:39.390
recall.

42:40.270 --> 42:44.810
Now you could easily always make sure that your recall is perfect.

42:45.790 --> 42:49.730
Recall means what is the number of relevant retrieved documents with

42:49.730 --> 42:52.150
respect to the number of all relevant documents.

42:52.430 --> 42:54.790
You just deliver all the documents.

42:55.510 --> 43:00.530
If you deliver all the documents, you definitely have provided all

43:00.530 --> 43:02.010
relevant documents.

43:02.410 --> 43:07.450
Because they are all in that list of documents, but this is not of no

43:07.450 --> 43:14.630
benefit to the person who actually asked this query, because you don't

43:14.630 --> 43:17.030
have an indication which is relevant, which is not relevant.

43:18.330 --> 43:20.310
Or you have to... it's just too large.

43:20.910 --> 43:24.310
And if you would like to have something which is always very precise,

43:24.910 --> 43:32.530
you just look for one or very few relevant documents in order to make

43:32.530 --> 43:35.170
sure that you don't provide irrelevant documents.

43:35.450 --> 43:37.310
Then you would have a very high precision.

43:38.950 --> 43:44.010
You don't provide irrelevant documents, but as long as the recall is

43:44.010 --> 43:47.570
small, it's also not what the user actually wants.

43:47.570 --> 43:52.850
And so the trade-off is, if you'd like to provide a large number of

43:52.850 --> 44:00.590
documents, you might be in a situation that you might also provide

44:00.590 --> 44:01.870
some irrelevant documents.

44:02.750 --> 44:08.590
And so you have to make sure that recall and precision are both quite

44:08.590 --> 44:14.030
high, but it's very difficult to have something which actually gives

44:14.030 --> 44:15.610
you a high value for both.

44:17.570 --> 44:20.790
The question is, what actually makes a document relevant for a query?

44:21.510 --> 44:23.510
What are the criteria for that?

44:24.250 --> 44:27.910
And so if you ask for something like algorithms, what we just looked

44:27.910 --> 44:32.490
at, you would like to get documents where algorithms actually appear.

44:33.090 --> 44:37.450
Now, how can you see whether a name or a term in the query actually

44:37.450 --> 44:38.110
appears?

44:38.590 --> 44:44.250
That means what we just had there on this response to the query we

44:44.250 --> 44:46.150
sent to MetaCrawler.

44:46.150 --> 44:49.490
In the titles, we had algorithms.

44:50.730 --> 44:56.070
Obviously, this looks like a list of relevant responses.

44:56.910 --> 45:04.590
So the first thing is, your term actually is present in the document.

45:04.790 --> 45:06.250
So you have a match in that document.

45:06.530 --> 45:08.210
So it occurs somewhere.

45:08.550 --> 45:09.890
You have your query term.

45:09.890 --> 45:17.250
Now, in the response we saw, the results actually were all in the

45:17.250 --> 45:20.310
title, or the word algorithms appeared in the title.

45:21.170 --> 45:25.010
And definitely, if the title of your document contains the word you're

45:25.010 --> 45:29.310
looking for, then you can assume that it is quite relevant.

45:30.530 --> 45:38.490
Then, if a query term appears very often, if this term appears many

45:38.490 --> 45:42.670
times in a document, then you can assume, well, this document will

45:42.670 --> 45:46.550
deal with exactly that term I'm interested in, and then it should be

45:46.550 --> 45:47.170
very relevant.

45:48.350 --> 45:52.850
Now, I just mentioned before that it's important to know what are the

45:52.850 --> 45:56.310
search terms that people use the most often.

45:56.730 --> 46:00.470
So if you try to optimize your webpage, and you just put in those

46:00.470 --> 46:05.790
terms into your webpage, then the user who gets this page as a

46:05.790 --> 46:10.570
response would see, well, there are those terms somewhere, but I don't

46:10.570 --> 46:13.250
see anything which is relevant with respect to my search.

46:14.210 --> 46:20.070
So this is something which is also a topic to be dealt with.

46:20.530 --> 46:22.610
What about misuse of those things?

46:22.770 --> 46:27.410
But initially, assuming that everybody is doing something which is

46:27.410 --> 46:36.070
adequate, then the number of query terms with matches is important.

46:36.210 --> 46:42.250
If you have all the terms in your query, if they all match, and not

46:42.250 --> 46:48.410
just one, the document should be more relevant than if you have only

46:48.410 --> 46:49.830
one term which matches.

46:50.410 --> 46:56.030
Then you have the number of matches, so I explained this in a slightly

46:56.030 --> 46:57.370
different way.

46:57.370 --> 47:00.630
So the first thing here was the number of query terms with matches.

47:01.170 --> 47:04.270
The next thing is the number of matches of these query terms within

47:04.270 --> 47:04.830
your document.

47:05.690 --> 47:09.370
So if you have several terms in your query, do they all match?

47:09.790 --> 47:11.310
Or does only one of them match?

47:12.030 --> 47:16.490
And if you have matches, how often do these matches occur?

47:16.930 --> 47:18.410
And where do they occur?

47:18.690 --> 47:21.790
If they are just somewhere at the end of the document, maybe the

47:21.790 --> 47:26.610
document is not that relevant as if they would appear directly in the

47:26.610 --> 47:33.050
title, or in the abstract, or at least in the initial phrases of the

47:33.050 --> 47:33.410
document.

47:33.690 --> 47:37.050
Because you as a user, you would click on the link that you get as a

47:37.050 --> 47:40.350
response to a query, and you would briefly look into the document.

47:40.550 --> 47:44.630
And if you don't see the term that you are interested in, you would

47:44.630 --> 47:46.290
assume that it's not that relevant.

47:47.170 --> 47:53.470
So the distance from the start is an important topic.

47:53.790 --> 47:56.450
Title, definitely zero distance from the start.

47:57.390 --> 48:01.490
And if you have something in the conclusion, then it is quite far from

48:01.490 --> 48:01.950
the start.

48:04.550 --> 48:10.430
So if you know that these are criteria for looking at the quality, or

48:10.430 --> 48:16.490
at the relevance of a document for a query, you know that if you want

48:16.490 --> 48:23.110
to make sure that people get access or get your pages as a result to a

48:23.110 --> 48:26.410
query, you know you have to put all the relevant phrases that are

48:26.410 --> 48:30.550
relevant for you, for your business, quite far to the front.

48:31.630 --> 48:34.330
And then you have something like the quality of matches.

48:34.510 --> 48:38.550
Maybe you have not a complete match, but just a partial match.

48:38.650 --> 48:41.950
Your word does not appear completely, but only part of that word.

48:41.950 --> 48:48.570
Which means your search engine must actually detect that, or must be

48:48.570 --> 48:50.930
able to look also for partial matches.

48:51.550 --> 48:53.030
Many search engines don't do that.

48:53.130 --> 48:54.790
They just look for complete matches.

48:56.770 --> 49:00.210
And then you might have even more information.

49:00.730 --> 49:02.790
You might have so-called meta tags.

49:03.810 --> 49:10.890
Let me just briefly go to that, for example, here to that page.

49:10.890 --> 49:17.050
And if we look at the... yes, exactly this here.

49:17.730 --> 49:20.770
Here we have certain meta information.

49:20.970 --> 49:23.430
You know that you can always provide meta information via

49:23.430 --> 49:23.970
descriptions.

49:24.630 --> 49:29.650
Then you have here content type, and you have something on language.

49:31.110 --> 49:35.230
Well, all kinds of things, which are...

49:35.230 --> 49:39.330
I don't see what I wanted to see here.

49:39.330 --> 49:42.750
Let me look at, for example...

49:42.750 --> 49:47.330
I don't know what we have on our page of our institute as meta terms,

49:47.570 --> 49:48.890
but there should be some.

49:55.290 --> 49:56.890
So, here...

49:58.970 --> 50:01.130
All kinds of meta information.

50:02.290 --> 50:05.810
And there should be meta information.

50:06.670 --> 50:11.150
I should know about this, because I should know what we put into our

50:11.150 --> 50:15.970
own portal, but normally I don't look at it in this way.

50:19.830 --> 50:23.590
So, there's something like content shortcuts.

50:24.590 --> 50:26.210
This is funny.

50:26.330 --> 50:30.310
I don't find the really important information.

50:31.670 --> 50:39.470
So, let me just briefly look to another interesting site, which I

50:39.470 --> 50:41.470
always looked at.

50:44.190 --> 50:45.570
It's an important site.

50:45.670 --> 50:46.930
It's the Gesellschaft der Informatik.

50:46.990 --> 50:49.330
Who of you is a member of the Gesellschaft der Informatik?

50:50.350 --> 50:51.230
Why not?

50:51.510 --> 50:55.030
It's an important association of professionals in computer science.

50:56.890 --> 51:00.270
So, students have a very low membership fee.

51:03.690 --> 51:07.830
Now, I just wanted to look at the...

51:07.830 --> 51:10.350
They all changed quite a bit.

51:11.290 --> 51:14.210
So, you see, we have here all kinds of things.

51:14.290 --> 51:16.650
Here we have styles, style information.

51:17.370 --> 51:24.290
And I don't see the information that I really wanted to show you.

51:24.690 --> 51:29.010
The meta information on keywords and things like that.

51:29.330 --> 51:31.590
I don't want to spend more time on that.

51:32.450 --> 51:34.230
But the information...

51:34.230 --> 51:42.750
The important point is that you can provide explicit information on

51:42.750 --> 51:46.930
relevance in the meta tags of a website.

51:47.430 --> 51:51.350
You can give information on keywords and content and so on.

51:51.810 --> 51:55.100
And search engines would look at that information.

51:55.650 --> 52:02.810
So, even if you have certain topics in your document, and you say,

52:02.930 --> 52:08.230
okay, there are certain phrases, certain keywords that should direct

52:08.230 --> 52:09.270
people to this page.

52:09.270 --> 52:14.110
Then you can list those keywords in the meta tag section, which is in

52:14.110 --> 52:19.090
front of most of the webpages.

52:20.690 --> 52:24.690
And then, certainly, the statements on relevance are not unique.

52:25.130 --> 52:27.910
It depends on the actual...

52:27.910 --> 52:29.470
Yeah, okay.

52:29.690 --> 52:35.030
It depends on the method of actually evaluating the relevance of your

52:35.030 --> 52:35.470
document.

52:36.230 --> 52:39.830
And we will come back to measuring relevance a bit later.

52:41.250 --> 52:46.350
I just want to briefly look at those topics and then come back to

52:46.350 --> 52:46.850
these things.

52:47.390 --> 52:48.690
Now, what is a query, actually?

52:48.970 --> 52:56.210
So, a query is a specification of information that you would like to

52:56.210 --> 52:56.930
access.

52:57.510 --> 53:02.550
So, you specify a certain string, or word, or term, or several words,

53:02.550 --> 53:05.470
which you would like to search for.

53:05.930 --> 53:08.590
And you can do that in different ways, as you know.

53:09.230 --> 53:16.070
For example, here you could search for all documents containing

53:16.070 --> 53:19.670
Université Karlsruhe, or Algorithmus, or whatever.

53:22.050 --> 53:31.570
So, this 2.35 terms, I didn't update that this year, but this will

53:31.570 --> 53:32.430
also change.

53:34.450 --> 53:38.110
Normally, most of you would just put in one word, maybe.

53:38.310 --> 53:42.190
Now we have these automatic extensions, and so the number of words

53:42.190 --> 53:43.770
might get larger.

53:44.150 --> 53:47.770
But usually you would not put in very long queries, but only rather

53:47.770 --> 53:48.430
small ones.

53:48.870 --> 53:54.150
But you can do something like searching for all terms in a string.

53:54.650 --> 53:58.010
Then you would only like to get documents which actually contain

53:58.010 --> 53:59.630
matches for all the terms.

53:59.630 --> 54:03.670
Or you say at least one of the terms should match, then you have the

54:03.670 --> 54:04.710
OR operation.

54:06.750 --> 54:12.470
And if you search, for example, what are we doing if we are searching

54:12.470 --> 54:14.330
for a document?

54:15.170 --> 54:18.750
You could say the document, like the way you actually retrieve

54:18.750 --> 54:19.410
information.

54:20.330 --> 54:22.430
What do you do if you are searching for documents?

54:23.310 --> 54:27.050
You could have the first assumption here that the document is a

54:27.050 --> 54:29.510
sequence of symbols over some alphabet.

54:30.910 --> 54:36.970
And if there is a match for such a query, like this is your document,

54:38.910 --> 54:45.770
and this document somewhere, or you could say, let me try it a

54:45.770 --> 54:48.950
different way, a document, usually we would indicate a document like

54:48.950 --> 54:52.470
this, but you could also certainly just transform that in a sequence

54:52.470 --> 54:53.190
of characters.

54:53.850 --> 54:57.250
And if in that sequence of characters the word you are looking for is

54:57.250 --> 54:58.930
occurring, then you have a match.

54:59.310 --> 55:04.910
And that means, if there is a match for a query A in the document,

55:05.190 --> 55:09.290
then this document is of the following form, that you have any symbol

55:09.290 --> 55:14.050
of your alphabet, and then the term you are looking for, and then

55:14.050 --> 55:19.390
another sequence of symbols from your alphabet.

55:20.430 --> 55:24.930
And so this is a description of a certain regular language actually.

55:25.950 --> 55:33.430
You are defining a language of all the elements of that language are

55:33.430 --> 55:39.810
hits to your query, all those words containing Universität Karlsruhe.

55:40.270 --> 55:44.390
So the query essentially corresponds to a regular expression which

55:44.390 --> 55:47.410
describes regular language.

55:48.790 --> 55:53.630
And if you have a regular expression, you know that you can define an

55:53.630 --> 56:00.730
appropriate finite automaton to perform that pattern matching task.

56:00.730 --> 56:05.470
So you could just check whether a certain document is an element of

56:05.470 --> 56:09.550
your language that you are interested in.

56:10.370 --> 56:16.330
So in that way, you would have to actually transform your query into

56:16.330 --> 56:20.690
an appropriate finite automaton, and then you could execute your

56:20.690 --> 56:21.070
query.

56:21.410 --> 56:26.630
That means you would take a query, transform it into an automaton, and

56:26.630 --> 56:31.730
look for the occurrence of this word within a long list of documents.

56:32.550 --> 56:36.990
But that would mean you would look for the documents, or you would

56:36.990 --> 56:39.810
search through the documents as a result of a query.

56:40.870 --> 56:48.630
So for example, if you have search for a certain phrase, not just one

56:48.630 --> 56:54.490
term, but sometimes you would, for example, if I would like to look

56:54.490 --> 57:02.150
for something which is specification of a string, if I would put that

57:02.150 --> 57:07.730
into quotes, put that as a query, you would look for the occurrence of

57:07.730 --> 57:10.850
exactly that sequence of symbols.

57:11.150 --> 57:12.150
You know that.

57:13.210 --> 57:19.210
The other assumption is that the documents are all represented by an

57:19.210 --> 57:25.670
index, and so this is some kind of data structure allowing for

57:25.670 --> 57:29.210
efficient retrieval, and so you have already processed all your

57:29.210 --> 57:34.330
documents, and then you have this index, and in order to answer such a

57:34.330 --> 57:41.290
query, you have to look at that index, and there you get information

57:41.290 --> 57:45.550
which documents actually contain a certain term that is within your

57:45.550 --> 57:45.830
query.

57:46.690 --> 57:54.390
So this here would be doing a search online in a set of documents, and

57:54.390 --> 58:00.230
this is just some pattern matching within documents, and this here

58:00.230 --> 58:04.990
would be pre-processing all the documents, and then searching only by

58:04.990 --> 58:10.530
looking into the index and retrieving the documents via an index.

58:12.290 --> 58:17.510
Okay, let us briefly look at these two different things.

58:17.510 --> 58:24.590
So I would like to mention both because we can only find information,

58:24.950 --> 58:30.750
or we can only perform full-text searches if we know how to actually

58:30.750 --> 58:33.690
formulate such a search, how to formulate that we look for the

58:33.690 --> 58:37.570
occurrence of certain patterns within a document.

58:37.850 --> 58:42.190
And so, just to remind you, regular expressions, you know what that

58:42.190 --> 58:42.550
is.

58:42.550 --> 58:48.970
We have, like this is something which is in the basic courses, you

58:48.970 --> 58:53.470
have certain base elements, and then you have expressions like alpha

58:53.470 --> 59:00.010
plus beta, alpha times beta, or alpha iterated, which are all regular

59:00.010 --> 59:06.410
expressions, and now you can simplify that, and for example, state

59:06.410 --> 59:11.650
that you look for the occurrence, like whether a certain term occurs

59:11.650 --> 59:18.550
or not, that means lambda, or empty set, or empty word, or alpha, you

59:18.550 --> 59:23.730
could say you look for the occurrence of some symbol between a i and a

59:23.730 --> 59:30.590
j, so the second thing here is a j, not an i, that means you assume

59:30.590 --> 59:34.270
you have an ordered set of symbols, or you could say you're looking

59:34.270 --> 59:41.450
for a certain term which is occurring at most k times, or at least k

59:41.450 --> 59:48.810
times, so these would be different ways of actually phrasing a query.

59:48.970 --> 59:52.970
So you could, if you would like to phrase a query in a more

59:52.970 --> 59:57.450
sophisticated way, using regular expressions, you could phrase a query

59:57.450 --> 01:00:05.270
like this, which would be looking for certain documents, starting with

01:00:05.270 --> 01:00:11.910
some symbols, and then you have either a capital or a small a, and

01:00:11.910 --> 01:00:18.710
then you have a sequence of letters, and then you have either s, or

01:00:18.710 --> 01:00:26.350
us, or en, or ic, or isch, so if you would look for algorithms,

01:00:26.630 --> 01:00:32.630
algorithmus, algorithmen, algorithmic, algorithmisch, and so on, and

01:00:32.630 --> 01:00:37.850
then you would have here another sequence of letters, and you would

01:00:37.850 --> 01:00:40.510
like to see that at least five times.

01:00:40.510 --> 01:00:45.610
So you look for all documents containing at least five times certain

01:00:45.610 --> 01:00:47.550
variants of the word algorithmus.

01:00:47.590 --> 01:00:52.650
Now, you will not find this type of interface in most search engines,

01:00:52.790 --> 01:00:55.850
but you could do something like that, and this would be just a

01:00:55.850 --> 01:01:02.010
specification of a certain finite automaton, which would exactly

01:01:02.010 --> 01:01:04.370
perform such a search as a pattern matcher.

01:01:04.710 --> 01:01:07.810
We will come to that topic of pattern matching in a moment.

01:01:07.810 --> 01:01:14.590
So this would allow for very specific searches, but I don't think that

01:01:14.590 --> 01:01:17.430
this would be adequate for people to perform that.

01:01:17.710 --> 01:01:21.830
If you look at the first interfaces, user interfaces, that search

01:01:21.830 --> 01:01:25.470
engines provided for formulating a query, you actually had

01:01:25.470 --> 01:01:29.070
sophisticated ways of phrasing your query with Boolean operations,

01:01:29.850 --> 01:01:31.910
and, or, and, not, and things like that.

01:01:31.910 --> 01:01:38.670
These are still available as extended search, or extended ways of

01:01:38.670 --> 01:01:43.210
phrasing a query, but the normal way of phrasing a query is just

01:01:43.210 --> 01:01:47.030
writing down a sequence of terms, and then you have standard ways of

01:01:47.030 --> 01:01:47.910
interpreting that.

01:01:48.950 --> 01:01:54.730
So, normally, these regular expressions are at most suitable as purely

01:01:54.730 --> 01:02:00.110
internal formats, but you could specify a very general search like

01:02:00.110 --> 01:02:00.510
this.

01:02:00.510 --> 01:02:03.210
Just to mention how one could do that.

01:02:03.350 --> 01:02:10.030
This is the typical way how you would specify some queries in

01:02:10.030 --> 01:02:11.310
information retrieval.

01:02:12.310 --> 01:02:18.730
So usually, we have simple specifications of queries, like just a list

01:02:18.730 --> 01:02:23.710
of words, and then you can sometimes choose whether you would like to

01:02:23.710 --> 01:02:28.410
have conjunctive or disjunctive search, and some more extensions if

01:02:28.410 --> 01:02:31.950
you would like to look for certain specific types of documents, like

01:02:31.950 --> 01:02:36.270
only look at links, or only at titles, or look only for specific

01:02:36.270 --> 01:02:38.350
languages, and things like that.

01:02:39.010 --> 01:02:41.690
And on this page here, I just show you some examples of search

01:02:41.690 --> 01:02:42.070
queries.

01:02:44.130 --> 01:02:47.630
I did not repeat this, this year I will do that again.

01:02:48.170 --> 01:02:50.970
The only thing I want to show you here is the following.

01:02:51.690 --> 01:02:59.730
Certainly, there is always some dynamic in the search engine, that

01:02:59.730 --> 01:03:08.790
means if you search for a topic, for a certain information, at

01:03:08.790 --> 01:03:10.550
different times you will get different results.

01:03:11.990 --> 01:03:16.210
So here, for example, for that search query here, for Angewandte or

01:03:16.210 --> 01:03:19.630
Informatik or Universität or Karlsruhe, so that means a disjunctive

01:03:19.630 --> 01:03:29.990
search for these terms, you would get at Google 97 million pages.

01:03:30.890 --> 01:03:33.770
No, this is...

01:03:37.650 --> 01:03:43.720
this was at different...

01:03:43.720 --> 01:03:45.200
what did I do here?

01:03:50.040 --> 01:03:52.540
I did not look...

01:03:54.580 --> 01:03:57.340
I looked at a different search engine.

01:03:58.780 --> 01:03:59.800
It doesn't matter.

01:03:59.940 --> 01:04:01.860
These are two different search engines.

01:04:01.980 --> 01:04:06.440
I think I looked at AltaVista and then also compared that to Google.

01:04:07.360 --> 01:04:12.680
And it's just the fact that you have very different numbers of search

01:04:12.680 --> 01:04:13.600
results.

01:04:14.460 --> 01:04:21.320
So, if you look for all documents containing one of those words, it's

01:04:21.320 --> 01:04:25.240
obvious that you will get a very large number of results here.

01:04:26.100 --> 01:04:30.500
Because the word Angewandte or Informatik or Universität or Karlsruhe

01:04:30.500 --> 01:04:31.780
will appear in many documents.

01:04:32.000 --> 01:04:37.900
If you look for the conjunctive search, where all those terms occur,

01:04:38.380 --> 01:04:43.480
then you definitely get a much smaller number, like 493,000, and at

01:04:43.480 --> 01:04:45.340
Google something like 83,000.

01:04:45.980 --> 01:04:49.100
So, a much smaller number of documents.

01:04:50.600 --> 01:04:57.680
And here, the same search engine that provided 493,000 in 2008

01:04:57.680 --> 01:05:01.560
provided 268,000 in 2007.

01:05:02.200 --> 01:05:03.840
So the number of results increases.

01:05:04.400 --> 01:05:09.060
But nobody of us, actually, whatever you, look at all these pages.

01:05:11.100 --> 01:05:15.360
So, is it actually necessary to have that many results?

01:05:15.480 --> 01:05:20.640
It just indicates there's a huge amount of information and you get

01:05:20.640 --> 01:05:23.380
displayed the top 10 or top 20 or something.

01:05:24.740 --> 01:05:29.060
And then you can also say, well, I can specify, I don't want the word

01:05:29.060 --> 01:05:32.480
mathematics in there, just Angewandte, Informatik, Karlsruhe, you get

01:05:32.480 --> 01:05:34.380
another different number.

01:05:34.380 --> 01:05:38.800
Or, for example, this is funny.

01:05:39.760 --> 01:05:40.760
No, it's not that.

01:05:41.080 --> 01:05:41.900
Yeah, it is funny.

01:05:43.300 --> 01:05:52.220
So, here we look for Angewandte, Informatik, Karlsruhe as a phrase,

01:05:52.360 --> 01:05:53.060
phrase search.

01:05:54.060 --> 01:05:59.040
And here, at Google, I got 47,700 results.

01:05:59.360 --> 01:06:04.040
But before, immediately before, I did a search for Angewandte and

01:06:04.040 --> 01:06:06.300
Informatik and Karlsruhe.

01:06:06.840 --> 01:06:12.240
Well, it said, and not Mathematik, so it's not that strange.

01:06:12.400 --> 01:06:14.100
But it was a smaller number.

01:06:15.540 --> 01:06:18.700
Yeah, it is strange.

01:06:19.380 --> 01:06:25.760
38,000 results for looking for documents containing all these terms,

01:06:25.920 --> 01:06:26.700
but not mathematics.

01:06:27.920 --> 01:06:33.440
And here we look for the phrase Angewandte, Informatik, Karlsruhe.

01:06:33.580 --> 01:06:36.480
Exactly those symbols in this sequence.

01:06:37.460 --> 01:06:43.480
And there were only 467 pages at, I think it was Alta Vista, and at

01:06:43.480 --> 01:06:45.060
Google it was 47,000.

01:06:46.320 --> 01:06:47.540
Very strange.

01:06:48.840 --> 01:06:53.420
And I don't believe that all those documents that were listed here had

01:06:53.420 --> 01:06:56.560
Angewandte, Informatik, Karlsruhe in exactly that sequence.

01:06:57.860 --> 01:07:00.120
So it is sometimes the results are a bit strange.

01:07:02.840 --> 01:07:08.480
And here, another thing, in 2007 it was 739 pages.

01:07:09.020 --> 01:07:11.900
And I don't believe that the number actually dropped.

01:07:12.280 --> 01:07:14.600
Normally the number of pages would be increased.

01:07:15.220 --> 01:07:17.940
So the results are not always reliable.

01:07:18.560 --> 01:07:23.260
It depends also where you are actually phrasing the query, whether you

01:07:23.260 --> 01:07:26.880
are currently in Europe, or in the States, or in Asia, you get a

01:07:26.880 --> 01:07:29.240
totally different number of results.

01:07:29.340 --> 01:07:33.980
So the only thing I wanted to mention here, or to indicate, is that

01:07:33.980 --> 01:07:40.400
you get indicators for the number of pages that you get, but sometimes

01:07:40.400 --> 01:07:45.980
they are not really intuitive if you compare the different numbers.

01:07:46.180 --> 01:07:51.100
Sometimes you would expect that the number of hits would drop, and in

01:07:51.100 --> 01:07:52.380
effect it's increased.

01:07:54.180 --> 01:08:02.420
Then you would like to look for numbers which are similar to or you

01:08:02.420 --> 01:08:07.880
would look to search for documents which contain terms which are

01:08:07.880 --> 01:08:09.660
similar to the words you have in your query.

01:08:11.000 --> 01:08:16.220
So the question is, when would we say two words or two query terms are

01:08:16.220 --> 01:08:16.560
similar?

01:08:17.320 --> 01:08:22.240
So we could look for something which is here, Hausaufgabe, Hausarbeit,

01:08:22.700 --> 01:08:28.800
Hausaufgabe, definitely very similar to Hausaufgabe without these

01:08:28.800 --> 01:08:29.760
hyphens there.

01:08:31.080 --> 01:08:35.260
Heimarbeit or Stadt, Städte, Stadt, maybe you wouldn't look for that

01:08:35.260 --> 01:08:36.740
one, but it looks similar.

01:08:37.440 --> 01:08:42.340
Or if you look for Karlsruhe, all kinds of misspellings of Karlsruhe,

01:08:42.700 --> 01:08:44.240
you would say these are very similar.

01:08:44.240 --> 01:08:48.040
So if you look for Karlsruhe, you would like to get also links

01:08:48.040 --> 01:08:51.940
containing these similar terms.

01:08:52.920 --> 01:08:56.600
So we have to look at the problem, how do we actually measure the

01:08:56.600 --> 01:08:57.840
similarity of words?

01:08:58.640 --> 01:09:00.620
There are many different measures around.

01:09:02.040 --> 01:09:06.180
One typical measure would be to look at the Hemming distance, which is

01:09:06.180 --> 01:09:12.940
just the number of positions where U and V differ, so the two words U

01:09:12.940 --> 01:09:22.480
and V, and to just compare the entries at these different places here,

01:09:22.740 --> 01:09:28.060
or no, the same position, you compare the same position in the two

01:09:28.060 --> 01:09:34.180
words, and whenever you have a mismatch, you just count that.

01:09:35.000 --> 01:09:39.720
So the Hemming distance is the number of positions where the two

01:09:39.720 --> 01:09:41.300
strings differ.

01:09:42.120 --> 01:09:45.700
So if you look at these examples that we had up here, if we look at

01:09:45.700 --> 01:09:51.200
Hausarbeit and Heimarbeit, the distance is three because only those

01:09:51.200 --> 01:09:56.660
three symbols here have a mismatch, the others all coincide.

01:09:57.600 --> 01:10:04.100
And if we look at Karlsruhe and Karlsruher, then here the distance is

01:10:04.100 --> 01:10:06.680
seven, so very far from each other.

01:10:06.820 --> 01:10:07.140
Why?

01:10:07.840 --> 01:10:12.080
Because if you look at, like here, the R is missing.

01:10:13.680 --> 01:10:18.740
And so this here, these parts are considered to be completely

01:10:18.740 --> 01:10:27.040
different because at the positions, the letters are just distinct, and

01:10:27.040 --> 01:10:30.580
you don't see that there's just one letter missing.

01:10:31.560 --> 01:10:35.620
But one letter missing is a typical typo, and you should be able to

01:10:35.620 --> 01:10:38.480
detect that this is just an editing fault.

01:10:38.680 --> 01:10:42.800
And so we have another distance, which is the editing distance, which

01:10:42.800 --> 01:10:47.200
is actually looking at the number of editing operations needed to

01:10:47.200 --> 01:10:51.400
transform this one string U into the string V.

01:10:51.400 --> 01:10:55.540
And so the editing operations definitely can be different types of

01:10:55.540 --> 01:10:55.900
things.

01:10:56.060 --> 01:10:59.100
You could delete a certain symbol, as I did here.

01:10:59.360 --> 01:11:03.380
If I delete the R, then I get to that sequence of symbols.

01:11:03.780 --> 01:11:09.900
Or I could insert some symbol, like here I have the R inserted at the

01:11:09.900 --> 01:11:10.160
end.

01:11:11.340 --> 01:11:13.220
Or I could replace something.

01:11:14.040 --> 01:11:19.460
So another thing that could be done, so for example here, the D is

01:11:19.460 --> 01:11:22.280
replaced with a T, and this is just one operation away.

01:11:23.500 --> 01:11:25.060
Replace, or switch.

01:11:26.080 --> 01:11:31.260
Switching is done, for example, if we...

01:11:32.300 --> 01:11:32.940
here.

01:11:33.720 --> 01:11:35.340
There, switching has been done.

01:11:36.020 --> 01:11:37.240
R, L, and L, R.

01:11:37.400 --> 01:11:39.200
Typical mistake there to do.

01:11:39.820 --> 01:11:43.340
So these are typical editing operations.

01:11:44.100 --> 01:11:49.680
And now, certainly the distance will depend on the type of editing

01:11:49.680 --> 01:11:52.040
operations that you actually consider.

01:11:53.260 --> 01:11:57.120
So maybe you only consider delete, insert.

01:11:58.300 --> 01:12:04.000
Then you have a different distance, as if you would say, also allow

01:12:04.000 --> 01:12:05.840
replace or switch.

01:12:06.460 --> 01:12:10.620
Because replace would be delete, and then insert.

01:12:11.720 --> 01:12:13.680
But replace could be one operation.

01:12:14.520 --> 01:12:17.960
So it depends on what kinds of operations you allow, and then you can

01:12:17.960 --> 01:12:19.480
compute the distance.

01:12:20.040 --> 01:12:25.240
Here, for example, for Hausarbeit and Heimarbeit, again, the distance

01:12:25.240 --> 01:12:26.280
would be 3.

01:12:26.480 --> 01:12:32.480
If we have delete, insert, replace, we would just replace A, U, and S

01:12:32.480 --> 01:12:33.580
with E, I, and M.

01:12:34.000 --> 01:12:36.940
But this is the same distance as with respect to the Hamming distance.

01:12:36.940 --> 01:12:41.680
The important thing here is the distance between Karlsruhe and

01:12:41.680 --> 01:12:42.420
Karlsruhe.

01:12:42.820 --> 01:12:47.640
There you would have one delete at that position, and one insert at

01:12:47.640 --> 01:12:49.740
that position, and you get to Karlsruhe.

01:12:50.760 --> 01:12:52.800
So there the editing distance would be 2.

01:12:52.800 --> 01:13:00.920
And we have a different way of measuring the distance, the similarity

01:13:00.920 --> 01:13:05.420
of two words, and the editing distance definitely is closer to what we

01:13:05.420 --> 01:13:11.640
would think is a similar word.

01:13:11.760 --> 01:13:16.060
So this is more close to the intuitive mission of similarity.

01:13:17.320 --> 01:13:21.620
But certainly the complexity of determining those distances are

01:13:21.620 --> 01:13:22.220
different.

01:13:22.920 --> 01:13:27.040
If we look at that, like if we look at the Hamming distance, if you

01:13:27.040 --> 01:13:32.740
just have to scan two words and check for all the positions where you

01:13:32.740 --> 01:13:35.620
have a mismatch, this is linear time, no problem.

01:13:36.300 --> 01:13:39.860
If you go to the editing distance, it is more difficult.

01:13:39.860 --> 01:13:43.880
First of all, it depends on the number of operations.

01:13:44.880 --> 01:13:49.800
And then you have to look how you can actually rearrange those things.

01:13:49.900 --> 01:13:53.940
You can have arbitrary positions where you can do insertions,

01:13:54.040 --> 01:13:55.080
deletions, and so on.

01:13:55.900 --> 01:14:00.940
And you can easily characterize the editing distance, so you could

01:14:00.940 --> 01:14:04.040
simply get estimates.

01:14:04.040 --> 01:14:09.000
You could say the editing distance definitely is always larger or

01:14:09.000 --> 01:14:11.600
equal to the difference in length.

01:14:12.040 --> 01:14:17.320
So if one here is much longer, certainly you have to insert all those

01:14:17.320 --> 01:14:20.980
characters in order to get the word.

01:14:21.200 --> 01:14:25.300
Like if this is U and you would like to look at how you can get V from

01:14:25.300 --> 01:14:28.760
U, you would need that many insert operations.

01:14:28.760 --> 01:14:34.140
But you could also say you are looking at the longest common

01:14:34.140 --> 01:14:35.820
subsequence of U and V.

01:14:36.480 --> 01:14:38.840
What is the longest common subsequence?

01:14:39.920 --> 01:14:45.520
The longest common subsequence would be just a sequence of symbols

01:14:45.520 --> 01:14:53.680
that are in that word and that occur also in this word V.

01:14:54.570 --> 01:15:02.940
So the indicated areas here need not necessarily actually be the same

01:15:02.940 --> 01:15:09.220
number of parts, just that the sequence of symbols that you have in W

01:15:09.220 --> 01:15:16.520
is the longest sequence of U that is occurring in V.

01:15:16.920 --> 01:15:21.180
Not necessarily in adjacent positions, maybe there are some other

01:15:21.180 --> 01:15:25.520
symbols in between, but the longest common subsequence can be

01:15:25.520 --> 01:15:30.020
determined definitely, which is not that simple, but it can be

01:15:30.020 --> 01:15:30.360
determined.

01:15:32.160 --> 01:15:37.900
And then you can just look through these two documents.

01:15:38.340 --> 01:15:41.660
That is not difficult to determine, the longest common subsequence.

01:15:42.220 --> 01:15:49.160
And then you can just determine or derive the exact value of the

01:15:49.160 --> 01:15:50.080
editing distance.

01:15:50.920 --> 01:16:02.640
You just have to look at the size of U, the size of V, and then you

01:16:02.640 --> 01:16:10.040
just, so what do you do if you add those two sizes, then you have all

01:16:10.040 --> 01:16:15.800
the symbols in U and V, and now the distance corresponds to all those

01:16:15.800 --> 01:16:20.980
symbols that you have to modify in order to get from one word to the

01:16:20.980 --> 01:16:21.240
other.

01:16:21.860 --> 01:16:29.920
That means you just have to subtract the number of symbols in that

01:16:29.920 --> 01:16:35.280
longest common subsequence, because you don't have to modify those.

01:16:36.580 --> 01:16:45.360
And so you have to delete this number of symbols, subtract this number

01:16:45.360 --> 01:16:51.120
for W, which is occurring as part of that sequence, and also as part

01:16:51.120 --> 01:16:51.880
of that word.

01:16:52.420 --> 01:16:57.120
That's why you have to subtract it twice, and then you have the value

01:16:57.120 --> 01:16:58.080
for the editing distance.

01:16:58.200 --> 01:17:00.740
But still you have to determine the longest common subsequence.

01:17:01.320 --> 01:17:04.240
It's not possible in linear time, it is a different algorithm.

01:17:04.620 --> 01:17:08.160
Here I say proof is an exercise, like to prove that this is the

01:17:08.160 --> 01:17:13.940
correct estimation is simple, but the common subsequence is defined

01:17:13.940 --> 01:17:19.000
here in a way that corresponds to what I just indicated here.

01:17:19.700 --> 01:17:30.400
The common subsequence is just W1 to Wk, where this W1 to Wk occurs

01:17:30.400 --> 01:17:38.700
both in U and in V, and it is interspersed with some other subwords in

01:17:38.700 --> 01:17:42.340
U and in V, which are not in the other ones.

01:17:43.720 --> 01:17:47.660
So this is just a simple way of describing the longest common

01:17:47.660 --> 01:17:53.180
subsequence, and what we then could do certainly is to look, for

01:17:53.180 --> 01:17:57.840
example, for all words, such that the editing distance is rather

01:17:57.840 --> 01:18:01.820
small, maybe less or equal to 2, so you allow for two editing

01:18:01.820 --> 01:18:05.440
operations, and look at all the words that you could get with two

01:18:05.440 --> 01:18:08.760
editing operations on the word that you are just looking for.

01:18:09.680 --> 01:18:13.800
Now I would like to show you how we can actually determine this

01:18:13.800 --> 01:18:21.280
editing distance, and so Lievenstein designed a specific algorithm for

01:18:21.280 --> 01:18:24.220
that, that's why it's sometimes called also the Lievenstein distance,

01:18:24.760 --> 01:18:27.660
but it is just the edit distance as I defined it.

01:18:27.660 --> 01:18:32.900
And the algorithm that is used here is essentially a dynamic

01:18:32.900 --> 01:18:37.480
programming approach, that means we look at the simple problems, and

01:18:37.480 --> 01:18:44.340
then for simple problems of determining the editing distance, and then

01:18:44.340 --> 01:18:47.340
go to larger, more complex problems.

01:18:47.680 --> 01:18:54.080
And the idea is to look how we can transform one word, S, into a word,

01:18:54.240 --> 01:19:01.800
T, so we have here, for example, a word S, which is some word, and we

01:19:01.800 --> 01:19:12.240
have a word T, another word, and then we would like to determine how

01:19:12.240 --> 01:19:18.840
we can actually transfer or transform S into T by editing this word S

01:19:18.840 --> 01:19:20.900
using the editing operations that are allowed.

01:19:21.900 --> 01:19:31.420
And we do that in a gradual way, we just determine values Dij, and Dij

01:19:31.420 --> 01:19:37.580
is the number of operations needed to transform the initial part of S

01:19:37.580 --> 01:19:40.000
into another initial part of T.

01:19:40.000 --> 01:19:49.760
So we look at the position i here, S1 to Si, and we look at T1 to Tj,

01:19:50.900 --> 01:19:57.600
and we just look at how can we transform S1 to Si into T1 to Tj.

01:19:57.600 --> 01:20:03.460
And if we have all these distances, then definitely we can at the end,

01:20:04.100 --> 01:20:16.740
look at what is the distance D N M, and in that, no, D, sorry, not N

01:20:16.740 --> 01:20:27.320
M, M N, so the distance M N would be the distance, the editing

01:20:27.320 --> 01:20:30.120
distance of S and T as a whole.

01:20:31.260 --> 01:20:32.280
Now how can we do that?

01:20:32.680 --> 01:20:37.360
You see a lot of symbols there, which you definitely will not

01:20:37.360 --> 01:20:38.880
understand immediately.

01:20:39.060 --> 01:20:40.220
So what is visible here?

01:20:41.600 --> 01:20:43.960
We do it in a dynamic programming approach.

01:20:43.960 --> 01:20:47.920
Dynamic programming always means that we have to build a certain

01:20:47.920 --> 01:20:48.360
table.

01:20:49.500 --> 01:21:01.260
And in that table, we will have to list S1, S2, S3, and so on.

01:21:01.860 --> 01:21:09.460
And here we list T1, T2, T3, and so on.

01:21:10.480 --> 01:21:16.300
And then here, essentially, we write down the empty word lambda or

01:21:16.300 --> 01:21:16.840
epsilon.

01:21:18.360 --> 01:21:24.720
And so this would mean, here we would indicate, what is the, like this

01:21:24.720 --> 01:21:29.200
would actually start with 0, distance 0 and something.

01:21:29.880 --> 01:21:34.500
Now if you have an empty string, how can you get to, like this is

01:21:34.500 --> 01:21:37.340
operation, number of operations, lambda to lambda is 0.

01:21:37.340 --> 01:21:40.320
Lambda to T1 would be one operation.

01:21:41.300 --> 01:21:44.420
To get T1 and T2 would be two operations.

01:21:46.120 --> 01:21:47.800
Three, and so on, very simple.

01:21:48.600 --> 01:21:53.120
And here you could say, okay, to transform a sequence S1 into the

01:21:53.120 --> 01:21:55.280
empty string, you have to delete that symbol.

01:21:55.940 --> 01:22:00.840
To delete S1 to S2 into the empty string, you have to delete two

01:22:00.840 --> 01:22:02.080
symbols.

01:22:02.980 --> 01:22:04.440
Three symbols, and so on.

01:22:04.440 --> 01:22:07.120
So the initial entries are very simple.

01:22:09.080 --> 01:22:10.260
That's the basis.

01:22:11.060 --> 01:22:11.720
Very simple.

01:22:12.860 --> 01:22:18.240
And now you look at what kind of operations do I have, like I would

01:22:18.240 --> 01:22:22.280
like to, for example, look at the entry for that field.

01:22:22.500 --> 01:22:30.600
Or, I look at a certain position Si, and here Tj.

01:22:32.120 --> 01:22:39.980
So I would like to see what do I have to put into that entry of my

01:22:39.980 --> 01:22:40.360
table.

01:22:40.360 --> 01:22:48.600
And I assume that I already have here S, I just write down the, or D,

01:22:48.720 --> 01:23:00.920
not S, D, I, and no, this would be, sorry, this would be I-1 and J.

01:23:02.100 --> 01:23:04.840
Then I would have, this is some information I have already.

01:23:04.840 --> 01:23:11.600
Then I have D, I, and J-1, information that I also have.

01:23:11.780 --> 01:23:17.180
And here, I would have D, I-1, J-1.

01:23:17.620 --> 01:23:21.200
So I assume I have those three entries already available.

01:23:22.660 --> 01:23:27.900
And I'm interested in finding out what is the distance now for D, I,

01:23:28.160 --> 01:23:28.380
J.

01:23:28.920 --> 01:23:31.320
How can I determine that?

01:23:32.320 --> 01:23:37.240
So, there are several different things I can look at, I have all this

01:23:37.240 --> 01:23:37.820
information.

01:23:38.860 --> 01:23:43.960
So I just know it will be the minimum of several possibilities.

01:23:45.440 --> 01:23:59.160
So, I could get from, for example, S1 to Si-1 to T1 to Tj.

01:24:00.620 --> 01:24:03.740
I know that, the distance Di-1 and J.

01:24:03.820 --> 01:24:08.780
Now I have one more symbol, Si is another symbol, Si.

01:24:09.420 --> 01:24:13.580
And I would like to get the same sequence as T1 to Tj.

01:24:14.220 --> 01:24:18.960
That means, here I had, here I-1.

01:24:20.980 --> 01:24:23.100
Now, I have another symbol.

01:24:23.940 --> 01:24:28.920
I know how I can get from S1 to Si-1 to T1 to Tj.

01:24:29.540 --> 01:24:32.080
I have one more symbol, I just have to delete that symbol.

01:24:32.560 --> 01:24:37.020
One operation, in addition to the distance of I-1 and J.

01:24:37.840 --> 01:24:49.920
Now, if I look at this entry here, I know how to transform S1 to Si

01:24:49.920 --> 01:24:54.480
into T1 to Tj-1.

01:24:55.760 --> 01:24:59.940
And now I would like to transform S1 to Si into T1 to Tj.

01:25:00.160 --> 01:25:05.280
I would have to insert something to what I had generated before.

01:25:05.280 --> 01:25:08.180
And so I would need one more operation.

01:25:10.200 --> 01:25:14.640
Then, I would like to look at this situation here.

01:25:14.780 --> 01:25:20.080
I know how to transform this initial part to that initial part.

01:25:21.040 --> 01:25:23.360
And now I look at the next symbols.

01:25:23.760 --> 01:25:27.180
The next symbols would be Si and Tj.

01:25:27.760 --> 01:25:30.640
Now, if they coincide, if they are the same, I don't have to do

01:25:30.640 --> 01:25:30.980
anything.

01:25:30.980 --> 01:25:39.760
That means then I would have Di-1J-1 is exactly Dij.

01:25:40.320 --> 01:25:42.700
Or, I wouldn't have to change something there.

01:25:43.360 --> 01:25:45.240
If it is the minimum, it has to be determined.

01:25:46.360 --> 01:25:50.620
But if they are different, those two symbols, I would have to replace

01:25:50.620 --> 01:25:53.360
one with the other.

01:25:53.900 --> 01:25:55.600
So it would be a replacement operation.

01:25:55.600 --> 01:26:02.040
Or if I only have insertions and deletions, I would have to delete

01:26:02.040 --> 01:26:02.820
plus insert.

01:26:03.320 --> 01:26:04.120
And it would be two.

01:26:04.700 --> 01:26:09.940
But here, since I assume I have also replacement, it is just adding

01:26:09.940 --> 01:26:10.360
one.

01:26:10.760 --> 01:26:16.220
And then I have these possible values, and I have to determine the

01:26:16.220 --> 01:26:17.660
minimum of those values.

01:26:17.800 --> 01:26:21.440
So it's the minimum of three values that are actually possible here.

01:26:21.880 --> 01:26:22.880
This is a simple thing.

01:26:24.160 --> 01:26:31.380
And in this way, we could, for example, look at the transformation of

01:26:31.380 --> 01:26:33.140
the word Sunday into Saturday.

01:26:34.820 --> 01:26:38.640
And if I look at the time, I see that time is over.

01:26:39.400 --> 01:26:43.080
And we have to start with that the next week.

01:26:43.420 --> 01:26:48.040
And as I said already, next week, I will not be here.

01:26:48.560 --> 01:26:53.440
But I will provide you with a recorded lecture that will cover exactly

01:26:53.440 --> 01:26:55.760
the things that I would have covered next week.

01:26:56.560 --> 01:26:58.380
Okay, so see you again in two weeks.

01:26:58.700 --> 01:27:03.420
And I will provide on the web, in the website of the course, the dates

01:27:03.420 --> 01:27:06.680
where I will not be here, so that you don't come in vain to this

01:27:06.680 --> 01:27:07.480
class.

01:27:07.840 --> 01:27:12.560
And I certainly always like to see you here, because I like to see

01:27:12.560 --> 01:27:13.060
responses.

01:27:13.060 --> 01:27:17.780
And what I noticed is, I have been talking all the time, nobody of you

01:27:17.780 --> 01:27:18.440
asked questions.

01:27:18.560 --> 01:27:21.400
Maybe it was so simple that it was nothing that had to be questioned.

01:27:22.000 --> 01:27:27.200
But you should always feel definitely free to ask questions.

01:27:27.500 --> 01:27:30.820
And I'm happy to answer those questions.

01:27:31.280 --> 01:27:32.660
Okay, thank you for your attention.

