WEBVTT

00:00.160 --> 00:01.240
Good morning.

00:01.420 --> 00:04.300
Welcome to another session of Algos for Internet Applications.

00:04.480 --> 00:06.200
I'm glad that you have come to this lecture.

00:08.240 --> 00:15.300
Initially, I had told you about a few different dates, but I changed

00:15.300 --> 00:19.920
that because today I'm actually at Karlsruhe, but it looks as if not

00:19.920 --> 00:22.220
very many have actually noticed that.

00:22.220 --> 00:30.220
So let me briefly write here the dates for the lecture that we have.

00:30.620 --> 00:40.900
So, the current lecture now is the 1st of December at 9.30, and then

00:40.900 --> 00:48.220
we will have lectures the 8th of December at 8 and at 9.30.

00:48.220 --> 00:53.780
And I will not be here on the 15th, I have to go to a project event,

00:54.420 --> 00:59.900
but I will teach on the 22nd at 9.30.

01:00.100 --> 01:02.620
Next year, hopefully, every Tuesday.

01:03.540 --> 01:10.540
So these are the dates for the current date for the lecture.

01:10.820 --> 01:18.600
If something is going wrong or is going differently, then I certainly

01:18.600 --> 01:19.820
will inform you in time.

01:21.240 --> 01:31.640
Okay, so let me just briefly get back to the brief

01:34.920 --> 01:37.220
review of what we have done last time.

01:37.280 --> 01:41.140
We have looked at how we can actually search for information.

01:41.420 --> 01:45.960
I've told you a little bit, I've shown you what's available on Search

01:45.960 --> 01:46.680
Engine Watch.

01:46.680 --> 01:51.460
We looked at the classification of search engines, like that there are

01:51.460 --> 01:54.900
search engines, directories and hybrid search engines, that there are

01:54.900 --> 01:59.500
spiders which retrieve the information from the web, there is the

01:59.500 --> 02:04.480
index that we build in order to improve or to make it possible that we

02:04.480 --> 02:09.640
have efficient searches, and there is the search engine software which

02:09.640 --> 02:14.060
analyzes a query, and we will look at all these topics.

02:14.060 --> 02:18.620
We looked at the quality of information retrieval, like looked at

02:18.620 --> 02:22.920
recall and acquisition as the two measures which are relevant for

02:22.920 --> 02:29.980
actually getting some quality measure with respect to the relevant

02:29.980 --> 02:34.820
papers or relevant documents we actually get, and how precise it is,

02:34.900 --> 02:38.260
how many irrelevant documents we get.

02:38.260 --> 02:46.960
Then we briefly looked at some way of characterizing when a document

02:46.960 --> 02:48.360
is relevant for a query.

02:48.580 --> 02:49.800
We will come back to that.

02:50.780 --> 02:56.680
And we also looked at what a query actually is, and looked at some

02:56.680 --> 03:01.660
formal way of expressing a query, or I showed you some examples of

03:01.660 --> 03:05.680
search queries and gave you some information on what actually happens

03:05.680 --> 03:10.180
if we put in those research or these search queries.

03:11.800 --> 03:15.800
We looked at similarity search, how can we actually look for words

03:15.800 --> 03:19.080
which are similar to a given word, what does it mean that words are

03:19.080 --> 03:22.540
similar, we looked at Hamming distance and editing distance, looked

03:22.540 --> 03:28.020
actually then at a special type of editing distance, the so-called

03:28.020 --> 03:32.100
Lievenstein distance, which assumes that you have three operations,

03:32.320 --> 03:37.660
insert, delete and replace, and then would like to get the distance

03:37.660 --> 03:40.540
between two words in a dynamic programming approach, we looked at

03:40.540 --> 03:45.200
that, had this example where we saw how the fields are entered there,

03:45.200 --> 03:48.040
how the numbers are entered by always looking at the neighboring

03:48.040 --> 03:53.640
entries, and then, like, if you have the three neighboring entries,

03:53.740 --> 03:57.600
you can put in the value of the next field, and in that way the

03:57.600 --> 03:59.360
dynamic programming approach works.

04:01.100 --> 04:05.660
Then we looked, yeah, this was pseudocode for that, and then we looked

04:05.660 --> 04:10.860
at pattern matching, a way of actually finding a verb inside a

04:10.860 --> 04:15.200
document, and we had the naive approach, then we looked at the

04:15.200 --> 04:19.840
algorithm of Knuth, Morris and Pratt, where we have to compute the

04:19.840 --> 04:24.180
next function, where we first of all try to exploit information that

04:24.180 --> 04:27.680
we have about the structure of the pattern, and if there are

04:27.680 --> 04:33.020
reoccurring parts, like if a prefix is partially reoccurring, then we

04:33.020 --> 04:38.540
can improve the way we can actually move forward the pattern in the

04:38.540 --> 04:42.300
document if we have a mismatch, and that was done by the next

04:42.300 --> 04:45.800
function, I showed you how that is computed, first of all I showed you

04:45.800 --> 04:50.580
how Knuth, Morris, Pratt is operating, then we looked at an example, I

04:50.580 --> 04:54.120
showed you how we compute the next, which is very similar to the way

04:54.120 --> 05:00.920
we actually perform the pattern matching, and then we saw also an

05:00.920 --> 05:05.380
example, a brief example of how we compute the next function for this

05:05.380 --> 05:13.160
nice word abracadabra, and then that was the last slide of the last

05:13.160 --> 05:21.840
lecture, and so we start with this slide variant of actually finding a

05:21.840 --> 05:26.380
pattern in a text, and what Boyer and Moore did, almost simultaneously

05:26.380 --> 05:31.440
to Knuth, Morris, Pratt, was that they said they would like to improve

05:31.440 --> 05:39.180
the search for a pattern by comparing a word from right to left

05:39.180 --> 05:46.880
instead of from left to right, so if we have here such a document, and

05:46.880 --> 05:53.580
we have our pattern, and would like to check whether that pattern is

05:53.580 --> 05:58.180
actually occurring in the document, so this is the document, that's

05:58.180 --> 06:02.900
the pattern, then Knuth, Morris, Pratt would start at this position,

06:02.960 --> 06:06.320
at the left position, Boyer, Moore start at the right-hand position,

06:06.880 --> 06:11.760
and so if they are unlucky, what happens, they would compare, they

06:11.760 --> 06:14.680
would compare, they would compare, and so on, and might have a

06:14.680 --> 06:20.700
mismatch here, and then you would notice, oh, maybe you have to shift,

06:20.700 --> 06:26.780
and then you would shift just by one position, and now, if this would

06:26.780 --> 06:29.900
continue like that, it would be very bad, if you would always have

06:29.900 --> 06:34.600
just a mismatch at the last position, you would re-compare positions

06:34.600 --> 06:39.220
inside your document here several times, this would mean that you

06:39.220 --> 06:44.720
would have a bad case, where you actually have, like the worst case,

06:45.240 --> 06:51.660
would mean that you have time complexity of length of the document

06:51.660 --> 06:54.140
times length of the pattern, would be very bad.

06:54.660 --> 07:00.460
But if you are lucky, like here you get a mismatch directly at the

07:00.460 --> 07:04.420
first symbol that you actually compare, and if you are really lucky,

07:04.840 --> 07:08.480
you know that if you have a mismatch there, you can actually shift

07:08.480 --> 07:14.720
your pattern all the way to the right, and then compare again, and if

07:14.720 --> 07:17.920
you have a mismatch there, and you know something about the structure

07:17.920 --> 07:23.480
of the pattern, you could shift again to the right, and compare here,

07:23.700 --> 07:24.720
and maybe you have a match.

07:24.720 --> 07:33.320
And so in this way, you would only compare a few symbols of your

07:33.320 --> 07:38.160
document, in order to find out where the pattern actually is located.

07:38.460 --> 07:45.760
And so, in the best case, a mismatch occurs at the rightmost symbol,

07:46.280 --> 07:50.940
we can move the pattern by n positions all the way to the right, and

07:50.940 --> 07:55.180
that means in that case, we would have a complexity of n over n.

07:56.200 --> 08:00.260
Sublinear patterns, for a large pattern with the right structure, it

08:00.260 --> 08:04.200
would mean we can really have a very fast search for patterns.

08:05.720 --> 08:09.120
And so this would be a real improvement, but certainly that's the

08:09.120 --> 08:12.480
worst case, and then we would be interested in finding out what the

08:12.480 --> 08:17.860
average case is, and this is always a problem, because we know that in

08:17.860 --> 08:20.960
order to analyze the average case, you have to know something about

08:20.960 --> 08:27.640
the distribution of documents and patterns, but if you make reasonable

08:27.640 --> 08:32.020
assumptions about the distribution of documents, what kind of

08:32.020 --> 08:35.520
documents you have, what kind of patterns you could have, it shows

08:35.520 --> 08:39.840
that on the average, you can expect sublinear behavior.

08:40.220 --> 08:44.720
So this obviously is an improvement in the average case, certainly in

08:44.720 --> 08:49.720
the best case, over Knuth-Morris-Pratt, by a factor of m.

08:50.240 --> 08:55.540
If Knuth-Morris-Pratt was linear in the size of the document, here we

08:55.540 --> 08:57.620
would have sublinear performance.

08:57.980 --> 09:02.260
Now we have to look at how that actually can work, that we are

09:02.260 --> 09:10.420
actually able to move that far, or how we can determine how far we can

09:10.420 --> 09:13.720
move a pattern after we have actually seen a mismatch.

09:14.320 --> 09:17.680
So this is the situation that we have to consider.

09:18.480 --> 09:25.300
We are in our document, that's the position i here, position i in our

09:25.300 --> 09:31.160
document, we are at position j here in our pattern, and we have

09:31.160 --> 09:32.140
noticed a mismatch.

09:32.140 --> 09:39.360
That means we have all these characters here matched with that in the

09:39.360 --> 09:46.080
pattern, and then we have a mismatch, d i is not equal to w j, and

09:46.080 --> 09:48.060
that is the assumption here.

09:48.160 --> 09:52.300
And then the question is, if we have such a situation, how can we

09:52.300 --> 09:56.520
determine how far we are able to move the pattern to the right?

09:57.520 --> 10:03.740
And what I have indicated here is combining two different assumptions.

10:04.520 --> 10:12.800
One thing would be, you could say, okay, I know that this symbol i has

10:12.800 --> 10:18.900
a mismatch at this position, so you could just look for the next

10:18.900 --> 10:23.960
position of d i in the pattern, and then move the pattern to the

10:23.960 --> 10:30.600
right, such that the next time you have here d i underneath d i in the

10:30.600 --> 10:34.440
document, you know that this year, at this position i in the document,

10:34.580 --> 10:35.620
you will have a match.

10:36.280 --> 10:37.800
But what about the remainder?

10:38.420 --> 10:46.720
If you are lucky, you have the situation that the following characters

10:46.720 --> 10:51.360
in the pattern after d i actually are the same as the suffix of the

10:51.360 --> 10:51.640
pattern.

10:51.640 --> 10:52.820
That would be nice.

10:52.900 --> 10:57.760
If that is not the case, you know that there cannot be a match, so

10:57.760 --> 10:59.680
then you could move even further.

11:01.480 --> 11:05.580
And certainly you don't know what about this position here, whether

11:05.580 --> 11:13.040
you have a match there, but if you know that there is such a situation

11:13.040 --> 11:19.440
that a suffix reoccurs, and certainly if you know that the d i does

11:19.440 --> 11:23.540
occur here as the next position, you certainly know this is the

11:23.540 --> 11:25.120
minimum distance that you can shift.

11:26.260 --> 11:30.200
Because the d i only appears there, so you can shift to that position.

11:30.560 --> 11:36.060
If you are lucky, you have here a matching suffix, and then you don't

11:36.060 --> 11:40.260
know about those positions, but you have to compare again from the end

11:40.260 --> 11:41.180
of the pattern.

11:41.180 --> 11:47.540
And now what we have to decide is how to determine this shift distance

11:47.540 --> 11:52.560
that we can get by exploiting the structure of the pattern.

11:52.700 --> 11:56.520
Similar to what we had in the Knuth-Marx-Pratt algorithm, there we

11:56.520 --> 12:01.900
determined the next position, the pattern that would move underneath

12:01.900 --> 12:06.960
the position i in the document, here we determine the shift distance.

12:07.180 --> 12:08.700
So that's a slightly different approach.

12:09.820 --> 12:13.520
So there are two different heuristics.

12:14.180 --> 12:18.080
One heuristic is where does the symbol actually occur inside the

12:18.080 --> 12:18.340
pattern.

12:18.920 --> 12:22.020
This gives us information on how far to shift the pattern.

12:22.700 --> 12:29.460
The other thing will be, do we have a reoccurring suffix in the

12:29.460 --> 12:33.580
pattern, so does it occur somewhere else, does it reoccur there.

12:34.140 --> 12:37.340
That will be the match heuristic, which we will see a little bit

12:37.340 --> 12:37.600
later.

12:37.600 --> 12:42.340
So the first thing here is to look at the occurrence of symbols inside

12:42.340 --> 12:42.850
the pattern.

12:44.720 --> 12:46.540
So that's what I indicated.

12:46.800 --> 12:51.420
So for this we need some information, which provides us with the

12:51.420 --> 12:53.810
appropriate shifting distances.

12:54.480 --> 13:02.780
So first of all, we determine a table, which tells us for every

13:02.780 --> 13:08.020
character that we have in our alphabet that can occur inside a

13:08.020 --> 13:14.080
document, for every character we determine the position where, like

13:14.080 --> 13:21.140
the rightmost position of that character inside the pattern.

13:21.140 --> 13:29.580
So if a character does not occur in our word, then the distance from

13:29.580 --> 13:31.380
the right end certainly is m.

13:32.200 --> 13:33.360
It does not occur.

13:34.780 --> 13:38.880
And if it occurs in some position, so if this here, like if we look

13:38.880 --> 13:44.940
for di for example, and this here is position r, then we know that the

13:44.940 --> 13:49.980
distance from the end here, from the right, is m minus r.

13:50.780 --> 13:56.120
m minus r is the distance from the right end of the pattern.

13:57.340 --> 14:04.500
If the character is occurring at position r in the document, and the

14:04.500 --> 14:09.540
character is not occurring at the positions to the right of that r.

14:09.540 --> 14:11.280
So at positions r plus 1 to n.

14:12.620 --> 14:14.480
Okay, so this is a distance table.

14:15.260 --> 14:18.640
And then we have to determine from that, we have to determine the

14:18.640 --> 14:19.380
shift distance.

14:19.580 --> 14:21.120
How far do we have to shift?

14:21.240 --> 14:23.740
And there are two different situations for that.

14:24.720 --> 14:31.160
The first situation is, well, we only know something about our

14:31.160 --> 14:32.800
characters that occur.

14:33.520 --> 14:37.540
And so if we know that, like if we look for the occurrence of a

14:37.540 --> 14:41.800
character inside our document, we look at the occurrence of a

14:41.800 --> 14:46.980
character c, and we currently are at position j in our document.

14:47.080 --> 14:48.100
This is position j.

14:50.120 --> 14:56.300
And now we know a certain character occurs, well, this is, if m minus

14:56.300 --> 14:59.660
j is greater or equal to delta of c.

14:59.660 --> 15:03.260
m minus j is this distance here.

15:04.100 --> 15:07.680
And if this is greater than the distance of the character from the

15:07.680 --> 15:10.940
rightmost end, it means that the character occurs somewhere over

15:10.940 --> 15:11.240
there.

15:13.820 --> 15:17.700
And if it occurs to the right of that position, then we can't do

15:17.700 --> 15:20.840
anything but just move it to the right.

15:23.300 --> 15:28.840
And so the next character here, like if you move it to the right, the

15:28.840 --> 15:33.800
next character here could be that symbol, that character c.

15:34.860 --> 15:37.480
So that is the maximum we can shift it.

15:39.320 --> 15:43.700
And we know that we have to shift because we know that the character

15:43.700 --> 15:49.960
does not occur to the right of that, but we have to shift exactly that

15:49.960 --> 15:50.380
distance.

15:50.380 --> 15:56.140
So we have to shift such that this character here is moved to the

15:56.140 --> 16:03.560
position right next, to the right of the right end of our pattern.

16:04.220 --> 16:08.580
And that means we have to shift by delta of c, which was the distance

16:08.580 --> 16:16.220
of that character from the right end, plus 1, in order to move it one

16:16.220 --> 16:19.560
position to the right, to move it to that position.

16:19.560 --> 16:22.100
This character has to be moved over there.

16:22.380 --> 16:23.520
That's delta c plus 1.

16:23.840 --> 16:31.360
The other case is that this character that we are interested in occurs

16:31.360 --> 16:35.920
to the left of our current position in the pattern.

16:37.020 --> 16:41.760
Now if that is the case, we know that this symbol is not occurring

16:41.760 --> 16:44.120
somewhere to the right of it, but just here.

16:45.080 --> 16:52.200
And since this might be looking for the occurrence of character di,

16:52.980 --> 16:57.680
which may be such a c, then we have to move exactly this distance

16:57.680 --> 16:58.060
here.

16:58.420 --> 16:59.600
What is this distance?

17:00.000 --> 17:02.640
This is delta of c.

17:03.660 --> 17:06.880
That's the complete... that is this distance.

17:06.960 --> 17:09.360
That's delta of c there.

17:09.360 --> 17:15.320
And then we have to subtract this part here, which is m minus j.

17:18.040 --> 17:23.420
And that means we shift that symbol such that it moved underneath the

17:23.420 --> 17:25.260
current position in the document.

17:26.060 --> 17:28.460
That is delta c minus m minus j.

17:29.840 --> 17:34.300
So it is... that's the case where c is to the left of position j.

17:34.980 --> 17:39.440
Those two situations we have to distinguish, and then we have some

17:39.440 --> 17:41.680
information on how we can shift the pattern.

17:42.780 --> 17:47.200
Now, let's... and then certainly one thing is essential.

17:48.180 --> 17:54.000
For most symbols of the alphabet, we have delta of c will be m,

17:54.180 --> 17:59.480
because a word that we look for will not contain all the 26 or even

17:59.480 --> 18:01.900
more letters of our alphabet.

18:01.900 --> 18:06.860
If we have the Latin alphabet, we have 26 symbols, plus a few others.

18:07.600 --> 18:13.160
So most of them will not occur, and so for most of the symbols we will

18:13.160 --> 18:14.560
have large shift distances.

18:15.500 --> 18:20.920
And if we have a mismatch at the rightmost position, a large shift

18:20.920 --> 18:23.900
distance means that then...

18:23.900 --> 18:33.160
or a large delta... if delta c is large, it means that this value will

18:33.160 --> 18:33.660
be large.

18:33.800 --> 18:36.760
We will have... this will be... it is this situation.

18:37.060 --> 18:41.440
Then we can move actually by m positions.

18:43.660 --> 18:48.160
So this is... or even m plus 1 positions in this case.

18:48.460 --> 18:53.040
So this is really leading to nice situations.

18:53.040 --> 18:58.340
If we have such a situation just at the leftmost position, like if we

18:58.340 --> 18:59.240
notice...

18:59.240 --> 19:04.040
if there we have a mismatch, and we look for a character which is not

19:04.040 --> 19:09.580
occurring in our pattern, then we can only move it by one position,

19:09.680 --> 19:17.640
because then delta of c is... would be m.

19:17.640 --> 19:22.820
Then we subtract m, and we add the position j, and the position j in

19:22.820 --> 19:24.060
this case would be just 1.

19:24.880 --> 19:35.200
So in that case we would have just 1... that then we could just shift

19:35.200 --> 19:36.200
by one position.

19:37.560 --> 19:42.260
Okay, but if we are at the rightmost end, we can shift actually by m

19:42.260 --> 19:42.640
positions.

19:43.840 --> 19:49.220
Okay, we will look at examples, but we can also look at how you can

19:49.220 --> 19:49.980
improve that.

19:50.780 --> 19:57.640
You could not just look at... like compute delta c not just to the...

19:58.200 --> 20:02.400
or we could look at c with respect to the occurrence from the

20:02.400 --> 20:07.560
rightmost end, and like we look for the rightmost occurrence inside

20:07.560 --> 20:08.120
the pattern.

20:08.120 --> 20:12.620
We could also look at the... like if this is our pattern, I said we

20:12.620 --> 20:20.780
are at position j, and now we look at the occurrence of w... of that

20:20.780 --> 20:27.840
character c, and so we could compute the shift distance with respect

20:27.840 --> 20:34.280
to this current position j here, and then we would look at where the

20:34.280 --> 20:39.500
symbol that we look for, the symbol in our document which was our di,

20:40.420 --> 20:44.080
where it occurs to the left of that position.

20:45.260 --> 20:50.500
Yeah, so there we would look for the rightmost occurrence in the

20:50.500 --> 20:58.540
pattern left of position j, and then this would lead to, again, to a

20:58.540 --> 21:00.460
different table, which might be better.

21:01.100 --> 21:04.520
And another thing what we could do...

21:07.800 --> 21:16.880
we could... if we have... if we are... like if we are at a position j,

21:17.120 --> 21:21.860
and this is less than the position m, like this is position m, if we

21:21.860 --> 21:26.080
are to the left of that, we had a match at this position here.

21:26.080 --> 21:30.640
So we know that this symbol occurred already in our document.

21:30.840 --> 21:34.660
So if we shift to the right, we could look for the next occurrence of

21:34.660 --> 21:37.520
this symbol in our pattern.

21:37.980 --> 21:42.800
Like if this symbol which is there at position... like this wm, if we

21:42.800 --> 21:47.000
look for the next position where that c occurs inside our pattern,

21:47.400 --> 21:53.860
then we know we should move that symbol underneath this location here.

21:53.860 --> 21:58.920
So you see that there can be many different heuristics how to move

21:58.920 --> 22:03.020
that pattern if you have a mismatch, and you try to find out the

22:03.020 --> 22:07.320
maximum shifting distance at this situation.

22:08.600 --> 22:14.060
Okay, so not look at what kind of symbol you had here inside in your

22:14.060 --> 22:18.200
document, but what about the reoccurrence of the symbol you just

22:18.200 --> 22:22.320
looked at, or where you know it occurred in the document, and where

22:22.320 --> 22:23.260
that reoccurs.

22:23.260 --> 22:27.800
And then you can... you would get different shifting distances, and

22:27.800 --> 22:31.680
you could go for the maximum shifting distance that you can get using

22:31.680 --> 22:32.760
these different heuristics.

22:35.500 --> 22:38.500
In a moment we will have, like on the next slide, we will have

22:38.500 --> 22:39.440
examples for that.

22:39.840 --> 22:44.720
I thought we would have an example here already, but somehow it's not

22:44.720 --> 22:45.020
here.

22:45.240 --> 22:49.220
Okay, this is... did not miss a slide, no.

22:49.340 --> 22:50.940
Then on the next slide we'll have examples.

22:52.340 --> 22:57.040
The match heuristic, just to complete it, the match heuristic now

22:57.040 --> 23:03.740
addresses this other idea that we would look for a reoccurring prefix,

23:04.140 --> 23:05.160
a reoccurring suffix.

23:05.920 --> 23:13.740
So if we are at position J in our document, and we know that we had a

23:13.740 --> 23:20.420
mismatch here at position J, we certainly look for a different symbol

23:20.420 --> 23:26.720
in our pattern, hopefully exactly the symbol that we saw in the

23:26.720 --> 23:31.240
document, but if you just analyze the pattern, you don't know which

23:31.240 --> 23:32.200
symbol that would be.

23:32.340 --> 23:37.680
You look for a different symbol than WJ, and you would like to have

23:37.680 --> 23:43.980
that the suffix that is to the right of J, where we know that this

23:43.980 --> 23:49.700
must have matched in the document, this should reoccur in our pattern

23:49.700 --> 23:55.220
to the right of position J minus S, and certainly the position here,

23:55.820 --> 23:58.480
or the character at that position, should be different from the

23:58.480 --> 24:02.600
character at that position J.

24:03.220 --> 24:13.160
So we should have WJ minus S is different from WJ, and for all the

24:13.160 --> 24:21.020
positions to the right, we should have that WK minus S is equal to WK.

24:21.440 --> 24:25.240
That means those parts here coincide.

24:25.720 --> 24:27.580
The suffix is reoccurring.

24:30.080 --> 24:37.480
Because then what we look for is for the rightmost reoccurrence of

24:37.480 --> 24:45.800
that suffix, such that the symbol to the left of that suffix is

24:45.800 --> 24:48.940
different from WJ.

24:50.280 --> 24:56.460
And now, certainly it may be that this situation does not occur at

24:56.460 --> 24:56.680
all.

24:56.680 --> 24:59.500
That means that there is no reoccurring suffix.

25:00.040 --> 25:01.720
What do we do in that situation?

25:02.820 --> 25:09.640
So in that case, we might have a situation where S actually is greater

25:09.640 --> 25:10.540
equal than J.

25:11.260 --> 25:14.300
We shift by more than J positions.

25:15.400 --> 25:20.560
If we shift by more than J positions, what does that mean?

25:20.560 --> 25:25.740
Like, if we shift by J positions, it would mean our pattern would

25:25.740 --> 25:26.880
reoccur here.

25:27.200 --> 25:30.300
Shifting by J positions would mean it reoccurs there.

25:31.380 --> 25:36.500
But if it reoccurs there, if you move it there, certainly then we

25:36.500 --> 25:41.440
would require that this part here, the prefix, should be exactly the

25:41.440 --> 25:41.880
suffix.

25:44.380 --> 25:48.320
But if this is not the case, we could shift even further.

25:48.320 --> 25:52.380
If there are differences, then we could shift further.

25:52.460 --> 25:54.540
In the best case, we would go to that position.

25:55.340 --> 26:00.040
That means S would be greater than J in that case.

26:03.220 --> 26:11.100
And for all K to the right of position J, we either have that the

26:11.100 --> 26:15.100
positions coincide or S is greater equal than K.

26:16.160 --> 26:21.080
That means we shift to some intermediate position.

26:22.320 --> 26:29.680
And for all those here, it does not reoccur.

26:30.120 --> 26:35.140
But from that position here, we have a reoccurring suffix, which

26:35.140 --> 26:39.040
reoccurs as a prefix in our pattern.

26:40.040 --> 26:46.720
And so the suffix, if the suffix reoccurs as a prefix, then that's our

26:46.720 --> 26:49.860
limit how far we can shift actually our pattern.

26:50.520 --> 26:55.720
And this is expressed exactly by this definition here.

26:56.360 --> 27:06.280
For the rightmost position, the minimum shift distance, such that S is

27:06.280 --> 27:15.860
greater equal to 1, and S is less than J, or WJ minus S is not equal

27:15.860 --> 27:22.980
to WJ, different symbols here, and this requirement that we have a

27:22.980 --> 27:29.020
reoccurring suffix, at least partially.

27:30.040 --> 27:33.820
Yeah, that is exactly that condition there.

27:34.700 --> 27:36.540
And here's an example.

27:36.980 --> 27:41.620
If we have a pattern, so we certainly look for words where we have

27:41.620 --> 27:47.320
reoccurring patterns inside the pattern, so obviously the A is

27:47.320 --> 27:52.280
occurring several times, NA is occurring several times, but if we now

27:52.280 --> 27:59.820
look at the distances, like if we look at reoccurring suffixes, if we

27:59.820 --> 28:06.420
assume we are at the rightmost position, so at position M,

28:09.500 --> 28:16.380
or in this case, position 6, what about a reoccurring suffix?

28:16.860 --> 28:22.560
The suffix in that case is the empty word.

28:23.580 --> 28:28.120
And the next position to the left of this position J, where this

28:28.120 --> 28:30.320
suffix reoccurs, is position 5.

28:31.620 --> 28:35.440
Because there is an N, and N is different from A.

28:37.680 --> 28:42.780
And so here we have a reoccurring, or we have a different symbol, the

28:42.780 --> 28:47.380
suffix is the same, it is the empty word, and so we can only shift by

28:47.380 --> 28:48.200
one position.

28:49.460 --> 28:55.400
If we would have a pattern, like assume we have a pattern, and there

28:55.400 --> 29:00.740
you would have AA in the end, and you would have a, let's say here a

29:00.740 --> 29:08.180
B, and you would have a mismatch at this position, there you would say

29:08.180 --> 29:15.360
that the suffix is the empty word, and the next position where the

29:15.360 --> 29:19.840
suffix occurs, with a different symbol to the left, is this here.

29:20.920 --> 29:22.460
Then you would have shift distance 2.

29:23.800 --> 29:24.360
Yeah?

29:24.980 --> 29:26.980
Just to show you how we compute that.

29:27.380 --> 29:29.120
Now let's look at the next situation.

29:29.600 --> 29:32.600
We are at position 2, no, position 5 here.

29:33.820 --> 29:34.780
Position 5.

29:34.920 --> 29:37.000
The suffix in this case is A.

29:38.380 --> 29:43.520
We have reoccurring suffix, or we have this reoccurring suffix A, at

29:43.520 --> 29:48.960
that position, at that position, but the rightmost position where it

29:48.960 --> 29:54.560
reoccurs, such that the left is different, is B here.

29:55.200 --> 29:59.440
So we have to shift, or we can shift all the way, such that the B is

29:59.440 --> 30:00.600
underneath this N.

30:01.940 --> 30:04.880
That means we have a shift distance, as you can see here, by 4.

30:07.180 --> 30:09.160
Now we look at the next situation.

30:09.960 --> 30:16.280
We have, we are at position 4, and the suffix is N-A.

30:17.560 --> 30:19.400
Now N-A does reoccur.

30:20.220 --> 30:22.040
N-A reoccurs here.

30:23.240 --> 30:24.460
There's a reoccurrence.

30:25.060 --> 30:27.540
But again, we have an A to the left of it.

30:28.600 --> 30:37.880
And then it does not reoccur, and actually, there's no matching, like,

30:37.880 --> 30:43.940
there's no suffix here that reoccurs as a prefix, and so we have the

30:43.940 --> 30:47.600
maximum shift distance of 6, in this case.

30:48.440 --> 30:51.980
Because we have to move the pattern all the way to the right of the

30:51.980 --> 30:54.300
original, or to the current position.

30:55.560 --> 30:57.440
There we have a shift distance of 6.

30:58.340 --> 31:05.340
Now if we are at position 3, the suffix is A-N-A.

31:06.520 --> 31:08.740
A-N-A reoccurs here.

31:09.600 --> 31:10.320
A-N-A.

31:10.840 --> 31:15.140
And B is to the left of it, which is different from what we had there.

31:15.220 --> 31:15.960
There we had an N.

31:16.620 --> 31:18.280
The shift distance is just 2.

31:20.880 --> 31:24.260
And so we have the entry of 2 there.

31:24.620 --> 31:29.300
And then if we go to A or to B, the suffix does not reoccur.

31:29.300 --> 31:33.980
There's no matching suffix of that suffix that reoccurs as a prefix,

31:34.600 --> 31:38.860
so we have the maximum shift distance for those two locations here,

31:39.320 --> 31:40.940
for positions 2 and 1.

31:41.820 --> 31:47.300
So these are the shift distances that we get by looking at a

31:47.300 --> 31:51.560
reoccurring suffix, such that to the left of it, there's a different

31:51.560 --> 31:51.980
symbol.

31:53.000 --> 31:58.720
And so this shows you that we get quite large shift distances

31:59.580 --> 32:02.620
depending on the structure of the pattern.

32:03.920 --> 32:11.000
And it's obvious that the match heuristic will shift the pattern the

32:11.000 --> 32:16.560
more to the right, the further to the left the suffix reoccurs.

32:18.300 --> 32:22.980
Because if it reoccurs at a position very far to the left of the

32:22.980 --> 32:25.740
current position, we can shift that far.

32:27.040 --> 32:30.940
And if it does not occur at all, we can shift all the way.

32:31.600 --> 32:32.640
We have maximum shift distance.

32:32.720 --> 32:40.080
Maximum shift distance means we get really an improvement with respect

32:40.080 --> 32:44.520
to the number of comparisons we have to make to find a pattern in the

32:44.520 --> 32:44.820
text.

32:46.180 --> 32:53.920
So this heuristic will be most effective for completely aperiodic

32:53.920 --> 32:58.560
words where there's no suffix that reoccurs at a prefix.

33:00.460 --> 33:07.200
And so if, yeah, obviously this is, as we have seen here in this

33:07.200 --> 33:12.460
example, if the suffix does not reoccur in the beginning, you get a

33:12.460 --> 33:13.920
large shift distance.

33:14.820 --> 33:18.280
And then you can combine those two heuristics.

33:19.000 --> 33:20.920
We'll get more examples in a moment.

33:21.360 --> 33:22.640
You can combine the heuristics.

33:22.740 --> 33:29.760
You just look at the occurrence heuristic and the match heuristic and

33:29.760 --> 33:33.140
just shift by the maximum of those two informations.

33:33.980 --> 33:37.560
And this essentially is the original algorithm by Boyer-Moore.

33:37.900 --> 33:41.340
If you look into the literature, you will find quite a few variants.

33:41.340 --> 33:47.260
I mentioned already that you can use different ways of defining the

33:47.260 --> 33:48.220
occurrence heuristic.

33:49.140 --> 33:53.160
And so there are some variants around.

33:53.760 --> 33:57.120
If you're interested, look at it in the literature and you'll find

33:57.120 --> 34:02.640
that the major idea is to compare starting from the rightmost end,

34:02.900 --> 34:08.460
from the rightmost symbol, and the other idea is to exploit

34:08.460 --> 34:13.460
information on the occurrence of characters and about the reoccurrence

34:13.460 --> 34:15.900
of suffixes inside the pattern.

34:17.760 --> 34:23.160
Strangely enough, in one standard book on information retrieval, they

34:23.160 --> 34:27.780
stated, like Frakes and Peiser-Yates stated, that the match heuristic

34:27.780 --> 34:31.060
would be not effective since periodic words are rare,

34:35.280 --> 34:42.360
but if periodic words are rare, we have, for many words which are

34:42.360 --> 34:46.240
aperiodic, we have maximum shift distances, as we have just seen.

34:46.760 --> 34:50.820
If there's no reoccurrence of the suffix, so it's not periodic in some

34:50.820 --> 34:53.300
way, we get the maximum shift distance.

34:53.740 --> 34:56.340
And so for those situations, it's very effective.

34:57.240 --> 35:03.580
If you have a very periodic word, then it is not leading to large

35:03.580 --> 35:04.320
shift distances.

35:04.740 --> 35:08.780
But if you have an aperiodic word, it is very effective.

35:10.120 --> 35:13.620
So for aperiodic words, sigma H has a maximum value.

35:14.620 --> 35:25.100
And we can briefly look at one tool that my assistant Kaibin Bao

35:25.100 --> 35:26.460
actually designed.

35:28.300 --> 35:33.020
A tool for getting those tables.

35:35.400 --> 35:36.840
Looks a bit strange.

35:39.340 --> 35:39.940
So...

35:41.860 --> 35:48.360
Now I have to do this.

35:48.820 --> 35:49.860
What can I do here?

35:50.920 --> 35:54.220
Like, the resolution is different from what I had before.

35:54.960 --> 35:56.560
I would like to...

35:56.560 --> 35:58.600
Ah, now it looks nicer.

35:59.180 --> 35:59.300
Okay.

35:59.300 --> 36:01.420
So this is...

36:01.420 --> 36:05.920
Here you have one pattern.

36:06.400 --> 36:07.080
A, I, A.

36:07.900 --> 36:08.880
A, I, A.

36:09.580 --> 36:11.060
Algorithms for Internet applications.

36:13.780 --> 36:14.680
There's a shift table.

36:14.820 --> 36:17.300
And here you see a situation where W...

36:18.000 --> 36:20.520
Like, if you are at a certain...

36:20.520 --> 36:23.040
Like, here different situations are displayed.

36:23.560 --> 36:27.360
Let's just briefly write here, for example, a different word.

36:29.300 --> 36:32.440
You see immediately you get here the distance tables.

36:33.840 --> 36:34.280
Abra...

36:34.980 --> 36:35.420
Ca...

36:36.900 --> 36:37.340
Da...

36:38.600 --> 36:39.040
Bra...

36:39.040 --> 36:41.780
Oops, even a bit of a space in between.

36:42.500 --> 36:45.020
And you see computation of the distance tables.

36:45.080 --> 36:46.160
You can play around with that.

36:46.220 --> 36:50.140
I don't want to go into the details of that simulation.

36:51.120 --> 36:52.840
One point is a bit strange.

36:53.180 --> 36:58.160
That it modifies here the illustration in some way.

36:58.400 --> 37:03.600
Which I have to ask Kevin Bao how you actually interfere with this.

37:05.320 --> 37:12.700
I have not found out how you can actually modify this in a controlled

37:12.700 --> 37:13.020
way.

37:13.020 --> 37:18.320
But you actually have here the...

37:19.720 --> 37:24.420
You have the word and you have the shift distance here, the sigma j.

37:25.320 --> 37:27.080
Okay, that's the match heuristic.

37:27.280 --> 37:29.580
So this is giving you the match heuristic for that.

37:30.840 --> 37:32.800
And if we delete the...

37:32.800 --> 37:35.100
If we change that...

37:35.100 --> 37:39.020
Delete the a here, then you see this is...

37:40.160 --> 37:44.620
The match heuristic shifting distance for abracadabra.

37:45.280 --> 37:49.260
Just use any word that you like and you see what the shift distances

37:49.260 --> 37:50.240
actually will be.

37:51.420 --> 37:55.220
Okay, just to show you that there's a tool where you can play around a

37:55.220 --> 37:55.680
little bit.

37:55.920 --> 37:59.420
And get some information on how those distances are computed.

38:00.160 --> 38:03.780
And since this is one of the...

38:03.780 --> 38:06.140
Like, in some way which is reasonable.

38:07.220 --> 38:13.220
You may run into a situation where you are asked to actually compute a

38:13.220 --> 38:15.060
match heuristic table for a pattern.

38:15.780 --> 38:18.960
And so if you have this tool available you can easily check whether

38:18.960 --> 38:23.100
what you compute actually is the correct table.

38:24.040 --> 38:25.380
And for that it is a nice tool.

38:26.700 --> 38:26.900
Okay.

38:28.480 --> 38:29.520
Pre-processing time.

38:29.860 --> 38:32.960
Like, I give you more examples in a moment.

38:32.960 --> 38:34.080
Pre-processing time.

38:34.580 --> 38:39.180
Like, how much time do you have to invest to actually find those

38:39.180 --> 38:39.660
tables?

38:40.160 --> 38:41.700
You have to analyze the pattern.

38:41.900 --> 38:42.080
Oops.

38:42.340 --> 38:43.560
You have to analyze the pattern.

38:44.520 --> 38:48.000
And this is order of m.

38:48.900 --> 38:51.100
And you have to do that.

38:52.360 --> 38:57.500
Like, to look into the alphabet table.

38:58.080 --> 38:59.820
The size of the alphabet is a constant.

38:59.820 --> 39:06.720
So it is just linear in the size of the pattern.

39:07.460 --> 39:09.480
So because you just have to look at all the different...

39:09.480 --> 39:14.080
Like, for every position you have to look up...

39:14.840 --> 39:21.700
Like, to look into the table where actually a pattern or a symbol

39:21.700 --> 39:22.200
occurs.

39:22.620 --> 39:25.520
Or where those suffixes reoccur.

39:25.520 --> 39:32.320
And the effort for that is just linear in the size of the pattern.

39:33.220 --> 39:35.380
As I said, there are several improved versions.

39:35.620 --> 39:42.160
You can actually improve it such that the worst case is not that bad.

39:43.240 --> 39:50.200
And you could also look at a related problem where you don't look for

39:50.200 --> 39:53.180
just one pattern, but for several patterns.

39:53.180 --> 39:56.840
So assume you have several patterns that you look at.

39:58.640 --> 40:00.280
And you could do the following.

40:00.440 --> 40:02.760
That you just apply k times the algorithm.

40:04.980 --> 40:11.620
But you could also maybe notice that they might have a matching

40:11.620 --> 40:12.120
prefix.

40:13.140 --> 40:15.780
Like, some have a larger matching prefix.

40:16.320 --> 40:18.360
Some just a smaller matching prefix.

40:18.360 --> 40:23.880
And then you can combine the search for those several patterns.

40:24.500 --> 40:31.100
And you build some kind of a tree where you look for one symbol and

40:31.100 --> 40:33.460
then you have to split in some way.

40:33.960 --> 40:36.400
And maybe you have to split later again to look for the other

40:36.400 --> 40:36.800
patterns.

40:37.200 --> 40:42.100
But you can combine the search for those patterns into one algorithm.

40:42.840 --> 40:52.400
And so if you search concurrently, look at shift distances related to

40:52.400 --> 40:56.100
these different patterns simultaneously, then you get an improved

40:56.100 --> 40:56.740
algorithm.

40:58.080 --> 41:02.460
And it takes the same time as looking for a single pattern.

41:03.160 --> 41:04.820
Almost the same time.

41:05.580 --> 41:12.060
And it's definitely not k times the effort that you have to do for the

41:12.060 --> 41:12.860
individual patterns.

41:13.900 --> 41:17.660
And there are algorithms which actually do that by Ejo Kurasik.

41:18.400 --> 41:20.800
He based it on Fluth-Norris-Pratt.

41:21.200 --> 41:25.300
And Kommenz-Walter who did it also for Boyer-Muhr.

41:26.060 --> 41:30.260
So just a combination of several approaches to look for several

41:30.260 --> 41:31.760
patterns simultaneously.

41:32.620 --> 41:36.320
Which you need, just to look back at the situation which was our

41:36.320 --> 41:47.020
motivation, you have a query and this query may contain a number of

41:47.020 --> 41:52.460
terms that you would like to look at simultaneously.

41:52.460 --> 41:59.960
And so this is a reasonable situation to look for several patterns in

41:59.960 --> 42:00.460
a document.

42:00.640 --> 42:05.380
And then you can do that by using those algorithms that I indicated

42:05.380 --> 42:05.740
here.

42:05.820 --> 42:08.420
But I don't go into the details of those two algorithms.

42:08.920 --> 42:11.880
They are essentially using the same ideas.

42:12.360 --> 42:18.480
In addition, they have nice ways of combining the search for different

42:18.480 --> 42:18.940
patterns.

42:18.940 --> 42:25.300
Now let's look back to an example of Boyer-Muhr again.

42:26.440 --> 42:32.440
The example for Boyer-Muhr in this case is just looking at the word

42:32.440 --> 42:33.120
example.

42:33.900 --> 42:41.680
To give you more feeling for what actually is happening here.

42:42.300 --> 42:44.000
So if we look at the occurrence heuristic.

42:44.000 --> 42:59.960
The occurrence heuristic was looking for every position of our pattern

42:59.960 --> 43:04.820
for the shift distance that we could actually have there.

43:04.900 --> 43:07.580
So if you look at the example, if you are at position 1.

43:08.120 --> 43:10.780
Position 1 is this position here.

43:10.780 --> 43:11.900
The E.

43:13.420 --> 43:16.640
Now you look for the symbol A.

43:17.400 --> 43:21.520
The symbol A occurs at position A.

43:21.800 --> 43:25.720
Here is that position.

43:26.840 --> 43:31.340
And it occurs to the right of our current position in the document.

43:32.300 --> 43:34.420
Sorry, our current position in the pattern.

43:34.420 --> 43:42.440
And so we have to move that A to the word example such that the A is

43:42.440 --> 43:43.600
moved to that position.

43:44.360 --> 43:46.620
That's exactly 5 positions.

43:48.080 --> 43:53.240
Now if we are at position 1 and we would never look for an E.

43:54.020 --> 43:57.220
Because this would not be a mismatch.

43:57.700 --> 43:59.480
So we don't need an entry for that.

44:01.340 --> 44:04.840
If we are at position 1 and we look for an L.

44:05.580 --> 44:07.680
L occurs at this position.

44:08.480 --> 44:13.300
So the L again would have to be moved by 2 positions.

44:13.840 --> 44:18.820
If we look for an M, we have to move by 4 positions.

44:19.060 --> 44:21.620
If we look for a P, by 3 positions.

44:21.620 --> 44:29.640
If we have an X, which is there, we have to move by 6 positions to the

44:29.640 --> 44:29.940
right.

44:30.500 --> 44:35.120
And if we have any other symbol, which the asterisk here is supposed

44:35.120 --> 44:35.840
to indicate.

44:37.020 --> 44:42.540
If we are at the first position in the document and we have a symbol

44:42.540 --> 44:47.400
which does not occur inside this pattern.

44:47.400 --> 44:53.120
Then we can only shift by one position.

44:54.740 --> 44:56.600
This is what I indicated to you.

44:57.680 --> 45:00.960
Then the delta would be M.

45:01.600 --> 45:07.520
But we are at the first position and we would compute M minus M minus

45:07.520 --> 45:07.820
1.

45:08.060 --> 45:11.460
So the result is one position that we can shift.

45:12.200 --> 45:16.120
And in this way we compute the other entries in this table.

45:16.820 --> 45:19.480
For example, here if we are at position 4.

45:19.680 --> 45:21.220
1, 2, 3, 4.

45:21.680 --> 45:24.200
Then we would be there at the M.

45:25.420 --> 45:28.760
As you can also see here, that's the position we currently are.

45:31.340 --> 45:37.820
If we look for an A, this is occurring to the left.

45:38.420 --> 45:40.060
Just shift by one position.

45:40.220 --> 45:42.560
If we have an E, it's occurring there.

45:42.660 --> 45:43.800
It's to the right of M.

45:44.400 --> 45:46.620
But we can only shift by one position.

45:47.120 --> 45:50.800
If we look for an L, also to the right, move by two positions.

45:51.460 --> 45:56.100
If we have a P, then we move by three positions.

45:56.240 --> 45:57.920
It's to the right of the current position.

45:58.360 --> 46:01.540
If we have an X, this is to the left of the current position.

46:01.700 --> 46:02.880
We move by two positions.

46:03.260 --> 46:05.200
We would end up in exactly this position.

46:05.920 --> 46:11.380
And if we have some arbitrary other symbol, we can move by two

46:11.380 --> 46:11.780
positions.

46:13.160 --> 46:13.880
Okay.

46:14.820 --> 46:17.320
And in this way, the remaining entries are there.

46:17.700 --> 46:28.040
If you look at the match heuristic for the last symbol, the symbol at

46:28.040 --> 46:35.000
position 7, again, since the symbol at position 6 is different, we can

46:35.000 --> 46:36.280
only shift by one position.

46:37.040 --> 46:44.380
And for all the other remaining elements, we can shift by six, not by

46:44.380 --> 46:44.840
seven.

46:45.300 --> 46:50.400
But we can shift only by six positions, because this suffix here, the

46:50.400 --> 46:52.500
E, is reoccurring as a prefix.

46:54.120 --> 46:58.680
So this is the largest suffix which reoccurs as a prefix.

46:59.600 --> 47:04.420
And so, since we have that, we can only shift by six.

47:04.420 --> 47:10.060
And this is the same for all the other positions, since there is no

47:10.060 --> 47:12.300
reoccurring suffix.

47:12.660 --> 47:19.720
So, like L-E, P-L-E, and so on, they all do not reoccur, except for

47:19.720 --> 47:22.280
that last symbol, E, which reoccurs at the beginning.

47:22.740 --> 47:25.840
So we have a shift distance of six for all of these other locations.

47:29.680 --> 47:38.060
Now, if we use that in the comparison, how the heuristics actually

47:38.060 --> 47:43.600
operate, we had just computed the shift and match heuristic shift

47:43.600 --> 47:47.640
distance, or this occurrence and match heuristic shift distances, and

47:47.640 --> 47:50.740
so we look for the occurrence of the word, or the pattern, for

47:50.740 --> 47:53.800
example, inside a simple document.

47:53.800 --> 47:58.240
The document is, this is a simple example.

47:59.520 --> 48:04.620
And what we have to do first is, like, we compare at this location.

48:04.820 --> 48:06.120
We compare S and E.

48:06.260 --> 48:07.100
We have a mismatch.

48:08.100 --> 48:10.500
One mismatch for comparison, one mismatch.

48:11.200 --> 48:12.500
We have different heuristics.

48:12.720 --> 48:16.360
The occurrence and match heuristics give us different entries.

48:17.440 --> 48:20.700
The S does not reoccur in our pattern.

48:21.630 --> 48:27.100
That means we have a shift distance of seven from the occurrence.

48:29.450 --> 48:35.580
The match heuristic just told us, well, the E is reoccurring as a

48:35.580 --> 48:39.960
prefix, so we can only move by one position.

48:41.530 --> 48:46.020
Which is not really what we would like to have.

48:47.780 --> 48:51.980
No, this match heuristic, sorry, this was different.

48:52.260 --> 48:54.500
This was wrong, what I just said.

48:54.980 --> 48:58.680
The reason why we have a one here is that we are at the rightmost

48:58.680 --> 48:59.160
position.

48:59.800 --> 49:02.460
The suffix that we have to consider here is the empty word.

49:03.320 --> 49:09.720
And since the left of that is L, which is a different symbol, the

49:09.720 --> 49:13.500
suffix, and so we can only move it by one position.

49:14.200 --> 49:14.880
Okay.

49:15.980 --> 49:20.040
The maximum of those two, of those is seven, so we shift by seven

49:20.040 --> 49:20.480
positions.

49:21.800 --> 49:23.200
This is the next situation.

49:23.840 --> 49:28.360
We have shifted by seven positions, and then we compare those two

49:28.360 --> 49:33.280
symbols, E and P. So again, we have one combination.

49:34.540 --> 49:44.400
The occurrence heuristic tells us the E, no, sorry, the P is

49:44.400 --> 49:48.560
reoccurring there, or is occurring, that's the rightmost occurrence of

49:48.560 --> 49:53.140
P inside our pattern, and so we can shift by two positions.

49:54.440 --> 49:56.800
Yeah, that was the occurrence heuristic.

49:57.240 --> 50:01.920
The match heuristic, again, only lets us shift by one position.

50:02.100 --> 50:05.220
It's the same as in the row above that, only one.

50:05.560 --> 50:12.880
Then, we have shifted by two positions, means that we compare here,

50:13.240 --> 50:17.660
those are, they are all matching, and there we have a mismatch now,

50:18.620 --> 50:19.720
this, at this position.

50:20.560 --> 50:26.800
Now we look for the situation, so we have made five comparisons, to

50:26.800 --> 50:27.840
get to this location.

50:29.100 --> 50:31.540
The occurrence heuristic is looking for an I.

50:32.620 --> 50:38.260
The I is not inside our word, but we are at position three, so we can

50:38.260 --> 50:41.960
only move by three positions, depending on the occurrence heuristic.

50:43.980 --> 50:50.620
The match heuristic told us, well, MPLE does not occur, and only the

50:50.620 --> 50:57.540
final suffix here, shifting distance of six, and three, so we move by

50:57.540 --> 51:02.760
six positions, then we compare here again, that's, we have seen that

51:02.760 --> 51:06.480
situation before, where we get two and one as the two entries for

51:06.480 --> 51:12.060
match heuristic, occurrence heuristic and match heuristic, and then we

51:12.060 --> 51:17.340
shift by two positions, and finally have to compare all the symbols

51:17.340 --> 51:26.860
again, and then we have seven comparisons.

51:28.820 --> 51:34.720
Although you certainly could also say, since I, no, this is not

51:34.720 --> 51:39.660
correct, I just wanted to say that, you know about the reoccurrence of

51:39.660 --> 51:44.240
certain symbols from your document inside your pattern, because you

51:44.240 --> 51:52.400
have used that information already, but you have to look actually at

51:52.400 --> 51:55.880
all these symbols starting from the rightmost end, and go all the way

51:55.880 --> 51:58.680
to the left, and find out that they all...

51:58.680 --> 52:00.780
there's nothing you can save really.

52:02.060 --> 52:13.940
Okay, so this is our example matching from, of the word example,

52:14.180 --> 52:18.860
inside a simple document, inside a simple text, and if we add up all

52:18.860 --> 52:24.420
this, you notice that we have 15 comparisons, string length is 24,

52:25.240 --> 52:28.920
Morris -Pratt in that case would require 25 comparisons.

52:30.540 --> 52:38.180
So we have saved 10 comparisons, and definitely sublinear number of

52:38.180 --> 52:40.240
comparisons, which definitely is an improvement.

52:41.420 --> 52:46.260
If you have, you can easily check if on your computers you have the

52:46.260 --> 52:52.640
right algorithm, if you would search for patterns of increasing size

52:52.640 --> 52:57.640
for a large pattern, which is aperiodic, you should get a very fast

52:57.640 --> 53:03.780
performance compared to a short pattern, which is periodic, because a

53:03.780 --> 53:10.880
large pattern would lead to a significant run time of such a pattern

53:10.880 --> 53:11.720
matching algorithm.

53:13.820 --> 53:20.180
Okay, that's an example, another example for Boyamur Abracadabra.

53:20.180 --> 53:26.740
I showed you that, for that I showed you in this tool, I showed you

53:26.740 --> 53:31.120
how the match heuristic looks like, that was this table, and we have

53:31.120 --> 53:36.520
the occurrence heuristic again here, you can look at the table, so

53:36.520 --> 53:43.300
here the delta actually is entered, so this is just showing for the

53:43.300 --> 53:52.000
letters in that word, that pattern, where they actually occur, so the

53:52.000 --> 53:58.880
distance of A from the rightmost end is zero, and obviously the

53:58.880 --> 54:02.600
others, here B is occurring in two positions from the rightmost end,

54:02.680 --> 54:03.220
and so on.

54:04.140 --> 54:10.740
This way you get this shifting table for the occurrence heuristic,

54:10.740 --> 54:18.340
where the maximum value actually is 11, which you get if you are at

54:18.340 --> 54:19.120
the rightmost.

54:19.180 --> 54:24.740
So this here, this entry, always is the case for, you have a pattern,

54:25.080 --> 54:33.080
or you have an element, or you have a mismatch, in the document you

54:33.080 --> 54:40.160
have a symbol which is not a pattern, and you can only shift as far as

54:40.160 --> 54:45.640
you currently are in your pattern, so if you are at the leftmost

54:45.640 --> 54:47.560
position, you can only shift by one position.

54:47.980 --> 54:56.340
If you are at the rightmost symbol, then you can shift for 11

54:56.340 --> 54:58.320
positions, all the way to the right end.

55:00.260 --> 55:04.060
Okay, and here since, like if you look at the match heuristic table,

55:04.060 --> 55:09.880
obviously this is quite periodic, like the A is occurring several

55:09.880 --> 55:18.240
times, the RA is occurring several times, but you see here, if you

55:18.240 --> 55:23.080
look at that suffix, RA, if you are at that position, it does reoccur,

55:23.300 --> 55:28.840
the RA, it occurs here, but it's again B at the right, B is the next

55:28.840 --> 55:35.080
symbol, so here you actually can move by 10 positions, because the A

55:35.080 --> 55:37.760
is reoccurring as a prefix.

55:38.820 --> 55:44.700
Both entries, if you are at this position, like the ABRA is occurring,

55:45.180 --> 55:52.780
the complete suffix here is reoccurring as a prefix, and the symbol,

55:53.300 --> 55:58.460
here the symbol to the left of that suffix was D, and here there's

55:58.460 --> 56:04.400
some, there's not the D, we don't know what symbol there is, exactly

56:04.400 --> 56:10.100
the case that we don't, like inside the pattern we don't have the

56:10.100 --> 56:14.680
situation that the suffix is reoccurring with a different symbol to

56:14.680 --> 56:20.120
the left, but the suffix is reoccurring as a prefix, actually the

56:20.120 --> 56:24.520
complete suffix, and so we have the maximum shift in distance.

56:28.700 --> 56:31.500
I'm showing you how these distances are computed.

56:32.340 --> 56:36.960
Okay, here is the example again for heuristic, you've seen that

56:36.960 --> 56:44.060
before, and this is the way we exploit information about the pattern.

56:44.780 --> 56:49.740
And if you compare those algorithms, Knuth-Forrest-Pratt, Knuth

56:49.740 --> 56:54.660
-Forrest -Pratt compares from left to right, Boyer-Moore from right to

56:54.660 --> 57:03.240
left, the, like in Knuth-Forrest-Pratt, we had a information from the

57:03.240 --> 57:08.000
function test, which would give us the position that has to be placed

57:08.000 --> 57:13.580
underneath the current position in the document, and for Boyer-Moore,

57:13.760 --> 57:18.140
the shifting distance, because this sigma of Jc defined the shift

57:18.140 --> 57:23.400
distance, and that position, so a different way of defining that.

57:24.520 --> 57:30.600
Next, here depends only on the pattern, in Knuth-Forrest-Pratt,

57:30.700 --> 57:37.500
whereas in Boyer-Moore, we also get information on the symbol from the

57:37.500 --> 57:39.160
document leading to a mismatch.

57:39.540 --> 57:43.920
That would give us some information on how far we can actually move.

57:45.360 --> 57:48.960
And then we have the worst case here, linear time, here we have

57:48.960 --> 57:55.280
product of n times, like document length, pattern length, average

57:55.280 --> 57:59.960
case, sublinear, and best case certainly also.

58:01.100 --> 58:08.900
This pattern, like Knuth-Forrest-Pratt has certainly also, although it

58:08.900 --> 58:15.060
is linear in its performance, it can have slightly different numbers

58:15.060 --> 58:20.800
of comparisons, so it is good for like patterns A, B, C, D, E.

58:21.580 --> 58:28.500
This next function doesn't give us a lot of information for its

58:28.500 --> 58:32.380
pattern, where you have only one symbol that is repeated.

58:33.440 --> 58:43.800
And if you have Boyer-Moore, it is good for like if you have a very

58:43.800 --> 58:52.020
periodic pattern, only A's in there, containing only very few A's, you

58:52.020 --> 58:53.620
can shift quite far.

58:54.420 --> 58:59.580
For almost all the symbols you can shift, for almost every position

58:59.580 --> 59:03.260
you would get a mismatch at the rightmost end, you lift the maximum

59:03.260 --> 59:03.820
distance.

59:05.200 --> 59:11.140
And it is bad for a pattern which is quite similar, where you have

59:11.140 --> 59:21.200
just at the first symbol a different one and there you would get quite

59:21.200 --> 59:25.020
a few comparisons, like more comparisons, more than linear number of

59:25.020 --> 59:25.480
comparisons.

59:25.920 --> 59:30.640
If you have a text containing only A's you would get a lift.

59:32.360 --> 59:38.900
This is just bad and good cases.

59:41.240 --> 59:45.960
So that is all I wanted to tell you about this approach of pattern

59:45.960 --> 59:46.480
matching.

59:46.520 --> 59:51.240
We have a query, and now we look into a document whether the query,

59:52.320 --> 59:56.060
whether the patterns occur inside the document or not.

59:56.060 --> 01:00:01.100
So here the approach is the one one of the two that I presented to you

01:00:01.100 --> 01:00:07.680
in the beginning pattern to search arbitrary documents as fast as

01:00:07.680 --> 01:00:08.080
possible.

01:00:08.680 --> 01:00:10.020
That's what we do here.

01:00:11.140 --> 01:00:15.640
Computing the shifting distance table with Morris Pratt, computing the

01:00:15.640 --> 01:00:21.860
next table we analyze the pattern and then consider all the documents.

01:00:21.860 --> 01:00:27.520
And now a different approach would be we analyze all the documents and

01:00:27.520 --> 01:00:30.260
then allow for very fast searches.

01:00:31.060 --> 01:00:32.980
This is a different approach.

01:00:33.100 --> 01:00:36.540
We do a pre-processing not on the patterns, but on the documents.

01:00:38.620 --> 01:00:42.060
And now it's a question how we can actually do that.

01:00:43.040 --> 01:00:50.540
How can we prepare a document for being analyzed for certain words.

01:00:51.700 --> 01:00:57.280
Normally, patterns that we look for usually are words.

01:00:57.480 --> 01:00:58.800
But the question is what is a word?

01:01:02.040 --> 01:01:05.320
This is essentially lexical analysis and compiling.

01:01:05.860 --> 01:01:10.380
You would like to find out which are the essential identifiers in your

01:01:10.380 --> 01:01:11.440
text, in your program.

01:01:11.960 --> 01:01:13.080
What are the separators?

01:01:13.880 --> 01:01:16.740
Or in Java you have a function called tokenizing.

01:01:17.560 --> 01:01:21.400
And this tokenizing function specifies a separator.

01:01:21.960 --> 01:01:31.940
And then you split your text into a sequence of substrings separated

01:01:31.940 --> 01:01:33.340
by those separators.

01:01:33.940 --> 01:01:38.060
The question is what are the appropriate separators to define words.

01:01:39.140 --> 01:01:43.620
So if you look at these texts we know, ok, presented is a word.

01:01:43.740 --> 01:01:44.680
Approach is a word.

01:01:44.980 --> 01:01:45.600
Toe is a word.

01:01:45.740 --> 01:01:46.420
Pattern is a word.

01:01:46.540 --> 01:01:47.360
Matching is a word.

01:01:47.800 --> 01:01:50.700
Or what is a word is a word.

01:01:50.980 --> 01:01:53.520
Four words inside the question.

01:01:54.180 --> 01:01:55.840
We immediately see what a word is.

01:01:56.040 --> 01:01:59.400
But now look at different examples.

01:02:01.380 --> 01:02:04.640
Is Jean-Claude, is that one word?

01:02:05.520 --> 01:02:07.680
Or is the hyphen a separator?

01:02:08.260 --> 01:02:09.320
Do we have two words there?

01:02:10.060 --> 01:02:13.060
Or is state of the art, is that one word?

01:02:14.940 --> 01:02:21.200
Or should this be considered as a combination of words, but not a

01:02:21.200 --> 01:02:21.860
single word?

01:02:22.340 --> 01:02:27.320
Or MS-DOS, nobody would look for MS-DOS, but somebody does.

01:02:28.380 --> 01:02:29.460
Is that one word?

01:02:29.540 --> 01:02:30.500
Or what happens here?

01:02:31.480 --> 01:02:38.020
You have a word which is split because it doesn't fit into this line,

01:02:38.720 --> 01:02:39.320
Arbeitsrecht.

01:02:40.640 --> 01:02:46.720
This is not really part, like this hyphen here is not really part of

01:02:46.720 --> 01:02:47.120
the word.

01:02:48.000 --> 01:02:51.900
It's just for breaking it across the two lines here.

01:02:52.420 --> 01:02:56.020
Or if you have even worse, if you have something like this, chapter 2

01:02:56.020 --> 01:02:58.940
.3, is that one word?

01:02:59.760 --> 01:03:06.580
Because we immediately see that this should be, chapter 2.3 should be

01:03:06.580 --> 01:03:12.800
considered as one word, because the 2.3 specifies the chapter.

01:03:14.500 --> 01:03:20.180
And you see that it is not simple to actually find out what words are.

01:03:20.260 --> 01:03:22.680
You need some heuristics to do that.

01:03:23.420 --> 01:03:27.660
Some rules how you decide what a word is.

01:03:28.620 --> 01:03:30.680
Then there are more problems.

01:03:31.060 --> 01:03:32.620
For example, BS2000.

01:03:34.440 --> 01:03:37.620
Who of you knows what BS2000 stands for?

01:03:39.640 --> 01:03:40.680
None of you knows.

01:03:41.160 --> 01:03:45.620
BS2000 is one of the most famous operating systems that has been

01:03:45.620 --> 01:03:47.920
designed by a German company, by Siemens.

01:03:48.420 --> 01:03:53.360
It was the common operating system of all the large mainframes

01:03:54.240 --> 01:04:01.440
produced by Siemens in the 80s, 90s of the last century.

01:04:02.240 --> 01:04:07.660
And BS2000 meant this is a very futuristic operating system, because

01:04:07.660 --> 01:04:12.780
in the 80s it said BS2000, this is good for the year 2000.

01:04:13.840 --> 01:04:15.020
So it's a long time ago.

01:04:15.140 --> 01:04:20.000
It was very similar to the IBM 360 operating system.

01:04:21.160 --> 01:04:21.920
Okay.

01:04:22.820 --> 01:04:23.520
F16.

01:04:23.520 --> 01:04:25.380
You know what an F16 is.

01:04:25.600 --> 01:04:27.900
A fighter in an airplane.

01:04:28.580 --> 01:04:29.260
H20.

01:04:29.500 --> 01:04:30.480
Chemical formulas.

01:04:32.120 --> 01:04:39.080
So you see what kind of problems we run into if we have to analyze a

01:04:39.080 --> 01:04:39.460
document.

01:04:40.660 --> 01:04:45.720
Another point is do we distinguish, do we say that words are different

01:04:45.720 --> 01:04:49.440
if they are different with respect to using small or capital letters?

01:04:49.440 --> 01:04:52.180
Or do we consider them to be the same?

01:04:52.980 --> 01:04:58.900
This is something actually which you can use, which we have as a

01:04:58.900 --> 01:05:06.080
parameter, whether a search engine should actually consider words to

01:05:06.080 --> 01:05:09.880
be different if you have small or capital letters.

01:05:10.120 --> 01:05:14.560
So you can have case-sensitive search or searches that are not case

01:05:14.560 --> 01:05:14.920
-sensitive.

01:05:16.460 --> 01:05:24.380
So sometimes this is very important and you have to consider these

01:05:24.380 --> 01:05:24.760
things.

01:05:25.740 --> 01:05:29.960
If we do that like if you look at lexical analysis, the things that we

01:05:29.960 --> 01:05:35.640
do here are quite similar or quite simple.

01:05:36.120 --> 01:05:38.720
You can do that using a finite state machine.

01:05:38.860 --> 01:05:42.360
I wrote automaton, but what I mean is a finite state machine.

01:05:42.820 --> 01:05:47.460
Lexical analysis actually is done using a finite state machine, finite

01:05:47.460 --> 01:05:51.420
automaton and there are standard tools for doing that.

01:05:51.800 --> 01:05:57.340
As I said, you can get a function tokenizing in Java which implements

01:05:57.340 --> 01:06:05.600
that if you just provide tokenizers and if you are performing a more

01:06:05.600 --> 01:06:13.540
sophisticated lexical analysis using more policies, more rules for

01:06:13.540 --> 01:06:18.220
actually defining what a word is and what a word should not be

01:06:18.220 --> 01:06:22.240
considered then you have a finite state machine which is a little bit

01:06:22.240 --> 01:06:22.940
more complex.

01:06:25.180 --> 01:06:29.900
And then you would like to filter relevant words.

01:06:31.500 --> 01:06:33.100
Now, what are relevant words?

01:06:34.700 --> 01:06:40.100
You cannot certainly if you consider or if you analyze a document you

01:06:40.100 --> 01:06:45.340
cannot say, ok, I know what words will be searched for because I know

01:06:45.340 --> 01:06:45.800
some queries.

01:06:45.880 --> 01:06:46.860
You don't know any queries.

01:06:47.860 --> 01:06:53.000
So you have to estimate which words should be considered to be looked

01:06:53.000 --> 01:06:53.420
for.

01:06:54.420 --> 01:06:56.900
So nobody would look for the word on.

01:06:57.840 --> 01:07:06.440
Nobody would look for the word off or eh or for all probably also not.

01:07:07.640 --> 01:07:12.500
So there are certain words where you can say these are words on some

01:07:12.500 --> 01:07:17.060
kind of stop list or negative list which you don't put into the

01:07:17.060 --> 01:07:20.280
collection of words that are occurring inside the document.

01:07:22.180 --> 01:07:27.080
So this stop list may be integrated into lexical analysis to just

01:07:27.080 --> 01:07:29.400
filter out the irrelevant words.

01:07:30.900 --> 01:07:37.240
This is not of real practical value but I said lexical analysis can be

01:07:37.240 --> 01:07:40.360
done by a finite automaton machine.

01:07:41.260 --> 01:07:45.820
So if you would like to build an automaton which should be able to

01:07:45.820 --> 01:07:52.140
filter out those words or identify certain words which are stop words

01:07:52.140 --> 01:07:55.400
then this automaton would do it.

01:07:55.400 --> 01:08:03.120
This is the words which could be put in as an input to that automaton.

01:08:04.200 --> 01:08:08.980
And if you have detected an A as the first character then you know

01:08:08.980 --> 01:08:19.120
it's only those two words which can occur here which like after the A

01:08:19.120 --> 01:08:26.300
and since the A itself is one of those stop words this is indicating

01:08:26.300 --> 01:08:33.120
the final state like this double ring final state and so because the

01:08:33.120 --> 01:08:35.700
lambda is in there it tells you have detected a stop word.

01:08:36.340 --> 01:08:40.380
Here if you continue and you have an N as next symbol again you have a

01:08:40.380 --> 01:08:45.140
lambda and here you have this final state and you have detected the

01:08:45.140 --> 01:08:48.780
AN, the N and there you would have detected the end.

01:08:48.780 --> 01:08:53.700
And the same for the other ones here if you have an A you have the

01:08:53.700 --> 01:08:55.960
possible extensions RN and NTO.

01:08:56.940 --> 01:09:02.080
You have not reached some end of a word.

01:09:02.600 --> 01:09:07.320
Here you have reached the end of a word again so you have another

01:09:07.320 --> 01:09:08.020
final state.

01:09:08.420 --> 01:09:12.500
So this is just giving you some idea how those automata could look

01:09:12.500 --> 01:09:17.460
like but we don't go into the details of filtering this, showing that

01:09:17.460 --> 01:09:24.620
this is some more or less standard way how you would deal with those

01:09:24.620 --> 01:09:25.180
texts.

01:09:25.660 --> 01:09:31.360
If you have a sequence of or a collection of words which have to be

01:09:31.360 --> 01:09:34.540
filtered out you can do that this way.

01:09:36.260 --> 01:09:42.100
So you can easily construct that and then you are able to detect the

01:09:42.100 --> 01:09:44.180
occurrence of words which are stop words.

01:09:45.620 --> 01:09:52.340
Whenever you are in such a situation if you are in a final state you

01:09:52.340 --> 01:09:53.460
know you can delete something.

01:09:54.780 --> 01:09:59.120
This was filtering out the irrelevant words which would look like

01:09:59.120 --> 01:09:59.160
this.

01:09:59.680 --> 01:10:08.920
And then we would like to make our tables that we have or the list of

01:10:08.920 --> 01:10:16.600
words that occur inside a document as short as possible or we could we

01:10:16.600 --> 01:10:22.640
would also like to be able to look for similar words for words that

01:10:22.640 --> 01:10:27.600
are reasonable to look for if you look for one particular word.

01:10:28.180 --> 01:10:30.780
So what we do here is what is called stemming.

01:10:31.280 --> 01:10:43.840
It means that if you search for the word working for example or for as

01:10:43.840 --> 01:10:50.880
I say here for ponies if this is your search term it would be nice to

01:10:50.880 --> 01:10:55.640
get documents which not only match the word working but also the word

01:10:55.640 --> 01:11:01.650
work which would be the stem for that word.

01:11:02.600 --> 01:11:05.900
Or if you look for ponies you should also look for pony.

01:11:07.880 --> 01:11:12.700
And so if you look for pony you get more matches if you look for the

01:11:12.700 --> 01:11:12.940
stem.

01:11:14.360 --> 01:11:18.680
And so this is generalizing your search to all the words having the

01:11:18.680 --> 01:11:19.220
same stem.

01:11:20.040 --> 01:11:25.360
Or if you look at the list of words that you would use if you analyze

01:11:25.360 --> 01:11:29.760
a document, if you just record all the word stems, you have a short

01:11:29.760 --> 01:11:30.240
list.

01:11:31.040 --> 01:11:34.500
Because you don't have to list all the variants of the word.

01:11:36.020 --> 01:11:43.940
Now we can do that by just looking at by just replacing or reducing or

01:11:43.940 --> 01:11:47.340
replacing or deleting suffixes.

01:11:48.640 --> 01:11:55.100
If you have an S in the end, that's plural from cats we get to cat.

01:11:55.800 --> 01:11:59.600
If we have an ED like plaster, it goes to plaster.

01:12:00.240 --> 01:12:06.720
And then there are all kinds of suffixes of words which can be

01:12:06.720 --> 01:12:09.600
released or shortened significantly.

01:12:11.940 --> 01:12:15.200
And you see this sometimes is an iterative process.

01:12:16.460 --> 01:12:23.940
So it may be that you have deleted the ED and then you have here a BL

01:12:23.940 --> 01:12:27.440
in the end, then you have to add an E again.

01:12:28.280 --> 01:12:29.680
This is an iterative process.

01:12:31.700 --> 01:12:37.100
And in this way you get all kinds of stems.

01:12:41.160 --> 01:12:49.900
So you see here for example, conformably which can occur as a result

01:12:49.900 --> 01:12:55.160
of taking out something or reducing something, then you have to

01:12:55.160 --> 01:12:58.120
replace that because this is the real stem.

01:12:58.900 --> 01:13:05.100
In this way you get this proper list of stems which is allowing you to

01:13:05.100 --> 01:13:06.320
generalize your search.

01:13:08.200 --> 01:13:08.480
Okay.

01:13:09.240 --> 01:13:10.900
And certainly this is language dependent.

01:13:12.280 --> 01:13:17.180
You have to consider the rule of getting to a stem in your different

01:13:17.180 --> 01:13:18.740
languages.

01:13:18.740 --> 01:13:26.760
Depending on your grammar or syntax of your language.

01:13:27.800 --> 01:13:29.760
Advantages, you have fewer words.

01:13:30.580 --> 01:13:34.940
You can reduce search structures by around 50%.

01:13:34.940 --> 01:13:40.680
And you have an automatic extension of the search to related words

01:13:40.680 --> 01:13:41.640
having the same stem.

01:13:43.460 --> 01:13:48.680
Some search engines actually offer you the possibility to look for

01:13:49.420 --> 01:13:51.000
word stems only.

01:13:51.640 --> 01:13:54.840
That means you have to have a broadened search.

01:13:56.660 --> 01:13:57.160
Okay.

01:13:57.360 --> 01:13:59.380
Then you have to build an efficient search structure.

01:14:00.400 --> 01:14:02.280
What is an efficient search structure?

01:14:02.500 --> 01:14:03.600
Kind of an index.

01:14:04.280 --> 01:14:05.640
Here, this is our document.

01:14:06.340 --> 01:14:10.780
Now we would like to get an index where we have a listing of all the

01:14:10.780 --> 01:14:11.100
words.

01:14:11.940 --> 01:14:19.380
And for every word we need some indication where that actually occurs

01:14:19.380 --> 01:14:24.200
inside this document or maybe also in another document.

01:14:25.360 --> 01:14:28.800
So what you actually would like to have is for every word that could

01:14:28.800 --> 01:14:37.060
occur in a query you would like to have a list of entries that tell

01:14:37.060 --> 01:14:42.100
you where does this word occur and in which documents at which

01:14:42.100 --> 01:14:42.700
positions.

01:14:44.000 --> 01:14:45.720
And maybe also how often does it occur.

01:14:46.200 --> 01:14:50.060
So where it occurs, document and location within the document.

01:14:51.040 --> 01:14:59.900
Then the relevance there may be some way of determining the relevance

01:14:59.900 --> 01:15:05.080
of that particular term.

01:15:07.180 --> 01:15:09.240
And how often it occurs.

01:15:10.380 --> 01:15:14.300
If a term occurs very often we would say it's more relevant for that

01:15:14.300 --> 01:15:14.660
document.

01:15:15.840 --> 01:15:21.020
So that's what I indicated at the beginning of this chapter some

01:15:21.020 --> 01:15:26.820
standard ways of looking at the relevance of a word or the relevance

01:15:26.820 --> 01:15:34.260
of a document for a certain query if a word occurs very close to the

01:15:34.260 --> 01:15:39.440
start of the document, in the heading, in the abstract or very close

01:15:39.440 --> 01:15:46.680
to the start of the document it will get a higher weight than a term

01:15:46.680 --> 01:15:48.920
that occurs only at the end of the document.

01:15:50.200 --> 01:15:50.760
Okay.

01:15:51.660 --> 01:15:54.360
And then the question is what kind of data structures do we need for

01:15:54.360 --> 01:15:54.680
that?

01:15:54.680 --> 01:16:00.620
And you could have just a sorted list with random access then you

01:16:00.620 --> 01:16:07.920
could always like this inverted file you look for a certain word, if

01:16:07.920 --> 01:16:12.780
you have a sorted list you can easily get the appropriate position and

01:16:12.780 --> 01:16:19.540
then it takes lock-in time maximum to get to the appropriate position

01:16:19.540 --> 01:16:25.220
and then you find the documents where you have to those are the

01:16:25.220 --> 01:16:31.080
documents that actually and then you know that you have to retrieve

01:16:31.080 --> 01:16:36.640
those documents as or at least retrieve a description of those

01:16:36.640 --> 01:16:42.740
documents display that to the user and that user can actually retrieve

01:16:42.740 --> 01:16:47.680
the document and you should also perform a ranking of the documents

01:16:47.680 --> 01:16:47.700
The word space is the word space of the query.

01:16:47.700 --> 01:16:47.740
This has a lot of words.

01:16:51.140 --> 01:16:53.660
So you can see a list of these documents for the query that you have

01:16:53.660 --> 01:16:54.360
phrased.

01:16:55.220 --> 01:16:59.340
So a sorted list is a simple structure, but you have logarithmic

01:16:59.340 --> 01:17:21.800
access time, the Then you need to reorder your sorted list.

01:17:22.700 --> 01:17:25.540
Linear time for insertions, deletions is not nice.

01:17:25.900 --> 01:17:28.020
So that is not really that perfect.

01:17:28.320 --> 01:17:29.760
Tree structures are much better.

01:17:30.860 --> 01:17:34.980
And there are some tree structures that are quite well known.

01:17:34.980 --> 01:17:42.880
B-trees are a way of storing information about large sets of data.

01:17:45.280 --> 01:17:52.260
So B stands for Bayer, which is a researcher at Munich.

01:17:52.860 --> 01:17:54.400
I used to be a researcher at Munich.

01:17:54.560 --> 01:17:57.840
Balanced search trees, one page per note.

01:17:58.640 --> 01:18:00.160
Do you know B-trees already?

01:18:00.760 --> 01:18:02.140
Who of you knows B-trees?

01:18:02.900 --> 01:18:03.880
None of you.

01:18:04.180 --> 01:18:07.560
So I will briefly give you an example of what a B-tree looks like, but

01:18:07.560 --> 01:18:10.480
only showing you the major ideas behind that.

01:18:12.860 --> 01:18:16.180
It's useful for large search structures on a hard disk.

01:18:17.680 --> 01:18:20.140
And then there are so-called...

01:18:20.140 --> 01:18:25.040
if you look at how this is spelled, you would try to explain it.

01:18:27.060 --> 01:18:34.140
But this is actually pronounced as trees, because it's coming from the

01:18:34.140 --> 01:18:38.520
word retrieval.

01:18:39.180 --> 01:18:43.780
And this is a special search structure, a tree-oriented search

01:18:43.780 --> 01:18:48.220
structure, for retrieval of information in documents.

01:18:49.340 --> 01:18:53.980
And I will give you an example of B-trees.

01:18:54.480 --> 01:18:56.100
Just a quick example.

01:18:56.100 --> 01:19:01.220
And I will show you how we can actually build those trees from

01:19:01.220 --> 01:19:02.080
retrieval.

01:19:02.920 --> 01:19:05.640
And in particular I will present to you Patricia trees.

01:19:08.140 --> 01:19:16.060
There we have a separation of search words into symbols.

01:19:16.180 --> 01:19:20.100
And we have a nice way of actually finding out all the words that

01:19:20.100 --> 01:19:23.180
occur in a document in a tree-structured way.

01:19:24.180 --> 01:19:26.360
Let me start with B-trees.

01:19:26.780 --> 01:19:32.440
As I said, a search structure supporting an efficient search for

01:19:32.440 --> 01:19:34.240
keywords from an ordered set.

01:19:35.940 --> 01:19:45.220
So in this B-tree we have these five assignments here, five rules.

01:19:46.720 --> 01:19:49.060
All the leaves have the same depth.

01:19:49.180 --> 01:19:50.460
I will give you an example here.

01:19:51.580 --> 01:19:57.580
Immediately, so for M equals 3, which is actually much smaller than

01:19:57.580 --> 01:20:03.720
the real B-trees are, usually the M will be, as I said, page size of

01:20:03.720 --> 01:20:11.220
the memory system, so at 1K or 2K or 4K bits.

01:20:11.920 --> 01:20:15.400
So this will be large values.

01:20:16.120 --> 01:20:20.040
All the leaves have the same depth, as you can see here in this

01:20:20.040 --> 01:20:20.400
example.

01:20:25.220 --> 01:20:31.180
Every inner node, except for the root, has at least M over 2

01:20:31.180 --> 01:20:31.560
successors.

01:20:32.160 --> 01:20:39.200
So it means that there are at least M over 2 successors.

01:20:39.200 --> 01:20:41.180
In this case, M is 3.

01:20:41.920 --> 01:20:45.900
M over 2, next larger integer, is 2.

01:20:46.860 --> 01:20:51.480
And so every node here has at least two successors.

01:20:52.840 --> 01:20:55.880
And at most, M successors.

01:20:58.780 --> 01:21:02.100
And here you have three successors.

01:21:02.300 --> 01:21:04.660
Here the root also has three successors.

01:21:06.560 --> 01:21:11.040
And every node with K successors contains K minus one keys.

01:21:11.360 --> 01:21:13.100
So what if we look for certain keys?

01:21:14.640 --> 01:21:20.420
For example, we could just enter, let's enter some numbers in here.

01:21:21.440 --> 01:21:25.360
So, what shall we use here?

01:21:25.820 --> 01:21:32.980
If I say here 7, no, I should just enter 2.

01:21:32.980 --> 01:21:43.340
So if we have 7 there, and let's say 23 there, then all the keys that

01:21:43.340 --> 01:21:52.300
are less than 7 will appear at the subtree that is to the left of 7.

01:21:52.940 --> 01:21:59.000
So we could say here, let's say 2 and 5, for example.

01:22:00.460 --> 01:22:06.240
Those could be entries like keys inside that B tree.

01:22:07.380 --> 01:22:12.520
And here, for example, we could have a value like 15, which is larger

01:22:12.520 --> 01:22:14.940
than 7, smaller than 23.

01:22:15.620 --> 01:22:26.100
And here, for example, we could have any value 25, 37.

01:22:27.030 --> 01:22:34.880
These would be two keys where, again, 25 is larger than 23.

01:22:35.260 --> 01:22:39.580
15 is less than 23, larger than 7, so it's in between here.

01:22:40.120 --> 01:22:44.040
That's the way we order the elements inside a node.

01:22:44.840 --> 01:22:51.720
And the branches here are determined such that you always look, or

01:22:51.720 --> 01:22:58.160
like if you go through that, you look for a certain key, and then you

01:22:58.160 --> 01:23:02.620
check at which branch you actually have to, or which branch you have

01:23:02.620 --> 01:23:03.080
to follow.

01:23:03.580 --> 01:23:07.660
It's always like if you follow this branch here, the key that you look

01:23:07.660 --> 01:23:12.480
for must be in between, like must be larger than 7, less than 23.

01:23:13.400 --> 01:23:15.220
That's an example of a structure.

01:23:16.040 --> 01:23:21.000
And in this way that this is constructed, you have a search structure

01:23:21.000 --> 01:23:27.900
where the paths from the root to the leaves all have the same length.

01:23:28.380 --> 01:23:37.700
The size of the nodes may be different in between half m or m-1

01:23:37.700 --> 01:23:38.280
elements.

01:23:38.280 --> 01:23:49.180
And if m is the size, it means that you can retrieve a complete node

01:23:49.180 --> 01:23:55.840
with one access to secondary memory to retrieve a complete page.

01:23:56.760 --> 01:24:01.760
And that's, one page is exactly one node.

01:24:01.760 --> 01:24:04.440
It always, such a node fits into a page.

01:24:05.320 --> 01:24:11.480
And so you can then analyze that page and continue your search.

01:24:11.480 --> 01:24:21.260
So this is a search structure which is oriented, or which is

01:24:21.260 --> 01:24:30.500
considering the page-oriented organization of secondary memory.

01:24:31.660 --> 01:24:38.420
And so this way you can have, this is a nice search structure for

01:24:38.420 --> 01:24:40.160
large databases, for example.

01:24:41.720 --> 01:24:45.640
Now the best thing certainly is if you don't have to actually go to

01:24:45.640 --> 01:24:47.900
secondary memory, because that's very time consuming.

01:24:48.660 --> 01:24:50.100
Back to that a little bit later.

01:24:51.260 --> 01:24:56.940
The operations on such a data structure are quite simple.

01:24:56.940 --> 01:25:03.520
So search, insert, delete can be easily done in log m of n.

01:25:03.960 --> 01:25:15.440
m in this case was, as I said, the degree of our tree.

01:25:15.440 --> 01:25:25.060
And so the path length in this case is always the same, is logarithmic

01:25:25.060 --> 01:25:30.540
in n to the base m.

01:25:31.340 --> 01:25:33.480
Some constant maybe multiplied to that.

01:25:33.860 --> 01:25:37.280
And n is the number of keys that you have in that search structure.

01:25:37.280 --> 01:25:43.120
So the search structure will always have this logarithmic behavior for

01:25:43.120 --> 01:25:48.940
searching, inserting something, deleting something.

01:25:51.060 --> 01:25:52.980
Okay, that's the B-tree.

01:25:53.460 --> 01:25:58.060
And in practice, as I said, m is about half the memory page size.

01:25:58.640 --> 01:26:02.160
And all the keys of a node fit into one page of memory.

01:26:02.300 --> 01:26:03.820
That's the major point.

01:26:04.740 --> 01:26:12.120
And then the next topic will be these trees for information retrieval

01:26:12.120 --> 01:26:14.140
of words by character-based comparisons.

01:26:15.280 --> 01:26:20.700
And one example that I can briefly show you is, assume you would like

01:26:20.700 --> 01:26:23.520
to search for those words.

01:26:27.360 --> 01:26:31.660
By looking at the first character, it's either an s or a w.

01:26:32.240 --> 01:26:35.780
If it is an s, you know exactly immediately it is this word.

01:26:36.280 --> 01:26:39.320
If it is a w, you have to branch further.

01:26:40.700 --> 01:26:48.080
If the next symbol is an o or an e, then if it is an o, you know this

01:26:48.080 --> 01:26:48.820
is the word wo.

01:26:48.900 --> 01:26:52.800
If it is an i, you know this is the word wir.

01:26:52.800 --> 01:26:55.980
If it is an e, you have to distinguish i and r.

01:26:58.360 --> 01:27:08.160
The tree structure to specify or to provide a way of searching for a

01:27:08.160 --> 01:27:09.700
word from that set.

01:27:11.120 --> 01:27:16.120
And so we can look at documents, word sets for documents, or

01:27:16.120 --> 01:27:21.080
indications how a document is structured and build a tree similar to

01:27:21.080 --> 01:27:21.340
that.

01:27:21.340 --> 01:27:22.680
We'll do that next time.

01:27:23.020 --> 01:27:29.520
Next time, as I've written there also, again, week 8 and at 9.30.

01:27:30.100 --> 01:27:33.340
Then on the 15th of December, there will be no lecture.

01:27:33.860 --> 01:27:36.080
So let me just write that again here.

01:27:37.660 --> 01:27:43.460
8th, we have lecture at 8, plus 9.30.

01:27:44.460 --> 01:27:47.400
On the 15th, there's no lecture.

01:27:47.960 --> 01:27:51.280
And on the 22nd, there's one.

01:27:53.320 --> 01:27:55.220
Thank you for your attention.

01:27:55.460 --> 01:27:56.260
That's it for today.

