WEBVTT

00:09.270 --> 00:09.970
Okay.

00:10.750 --> 00:17.370
Welcome, also those that just entered the room, to this lecture on

00:17.370 --> 00:18.990
algorithms for internet applications.

00:19.730 --> 00:26.390
I'm here again in class after one week where I have been away.

00:27.070 --> 00:31.950
I've been at an interesting series of talks to Canada, on special

00:31.950 --> 00:35.870
invitations there, and so I couldn't be here.

00:37.470 --> 00:41.370
But I hope it was okay to give you the recorded lectures of previous

00:41.370 --> 00:45.310
years because the material had been recorded there.

00:45.670 --> 00:50.250
But still I think it's good to have a lecturer in class, because I'd

00:50.250 --> 00:52.890
like to see you, I'd like to see who is in my class.

00:53.390 --> 00:58.730
And so I would just like to briefly go over the slides that we had

00:58.730 --> 01:05.250
before, like not again those that I had before.

01:05.930 --> 01:06.750
What is this?

01:07.590 --> 01:09.570
I don't want to change anything.

01:13.750 --> 01:14.330
What?

01:18.520 --> 01:20.960
I don't know what happens if I would do that.

01:21.420 --> 01:22.020
I have no idea.

01:22.780 --> 01:24.020
I just cancel this.

01:25.500 --> 01:30.600
Okay, so we had looked at the Liezenstein distance, that was the last

01:30.600 --> 01:32.780
slide I showed you two weeks ago.

01:33.100 --> 01:35.940
I don't want to go over all of these things again, but remember that

01:35.940 --> 01:39.400
we had looked at the editing distance and how we can determine the

01:39.400 --> 01:43.620
distance of two words, like doing this in this dynamic programming

01:43.620 --> 01:49.360
approach, looking at how we can actually transform a short sequence

01:49.360 --> 01:52.540
into another short sequence, and then use the information about short

01:52.540 --> 01:56.620
sequences or prefixes of those two strings in order to derive

01:56.620 --> 02:00.260
information on extended prefixes.

02:01.160 --> 02:11.660
Now we then had topics like, yes topic, how we can actually search for

02:11.660 --> 02:12.260
information.

02:12.960 --> 02:19.840
And the simple search is just designing in a naive way some kind of

02:19.840 --> 02:21.800
finite automaton to search for a string.

02:22.300 --> 02:26.260
The next thing is to look at a slightly more intelligent approach,

02:26.440 --> 02:27.760
which had been designed

02:38.120 --> 02:45.000
in 77, where the information that has already been scanned on the

02:45.000 --> 02:50.080
pattern and the string actually is used in order to compute reasonable

02:50.080 --> 02:56.860
shift distances, or better to say for Ruth Morris Pratt, to find out

02:56.860 --> 03:01.760
what the next symbol of the pattern is that should be compared with

03:01.760 --> 03:06.280
the current symbol in the text after a mismatch has occurred.

03:06.440 --> 03:11.280
That was the purpose of the next file or the next function, which

03:11.280 --> 03:16.020
gives you the position of the next symbol of the pattern that should

03:16.020 --> 03:20.240
be positioned under the current symbol in the text.

03:20.440 --> 03:25.860
And in this way, the position of the text never has to be reduced, but

03:25.860 --> 03:31.440
it will always be a monotonically increasing sequence of positions.

03:32.060 --> 03:37.500
And in that way, we get a linear time for pattern searching or pattern

03:37.500 --> 03:39.480
matching in a document.

03:39.680 --> 03:41.820
This was the main point there.

03:42.380 --> 03:47.040
And the algorithm was, it's a rather short algorithm, but very

03:47.040 --> 03:54.240
effective, leading to a time or a linear time in the size of the

03:54.240 --> 03:55.640
document that has to be searched.

03:55.820 --> 03:57.500
There were some examples given.

03:58.100 --> 04:03.620
I will not go through that example again, but just like that was in

04:03.620 --> 04:04.460
the recorded lecture.

04:05.560 --> 04:09.060
And then the improvement was the next thing, the improvement by Boyer

04:09.060 --> 04:09.460
and Moore.

04:09.860 --> 04:13.700
And their main idea, which is just the interesting idea, just not to

04:13.700 --> 04:17.680
look or to compare a pattern starting at the leftmost symbol, but at

04:17.680 --> 04:19.100
the rightmost symbol of the pattern.

04:19.940 --> 04:26.320
And in that way, as we have seen, it is possible to get an expected

04:26.320 --> 04:33.540
runtime that is sublinear, that is length of the text divided by the

04:33.540 --> 04:34.340
length of the pattern.

04:34.980 --> 04:40.860
But in the worst case, you get a time that is the product of those two

04:40.860 --> 04:41.500
lengths.

04:41.620 --> 04:46.280
So product of the length of the document times the length of the

04:46.280 --> 04:46.520
pattern.

04:46.760 --> 04:51.060
And so that is something which is not good.

04:51.320 --> 04:59.580
But since most documents are not worst case documents, the runtime of

04:59.580 --> 05:02.940
Boyer -Moore is actually very good and better than the Knuth-Morris

05:02.940 --> 05:03.580
-Pratt algorithm.

05:04.340 --> 05:07.560
So there were two heuristics, the occurrence heuristics, looking at

05:07.560 --> 05:11.980
the occurrence of a pattern in the document, of a symbol in the

05:11.980 --> 05:12.360
document.

05:12.620 --> 05:17.840
So a symbol that is just looked at in the document is looked for in

05:17.840 --> 05:18.320
the pattern.

05:18.540 --> 05:21.500
And if it does not occur in the pattern, you definitely can shift the

05:21.500 --> 05:25.140
pattern all the way past that symbol.

05:25.920 --> 05:29.360
Remember that we, in that case, we are starting at the rightmost end

05:29.360 --> 05:29.940
of the pattern.

05:30.520 --> 05:33.400
And as soon as we have a mismatch, we can look whether this symbol

05:33.400 --> 05:35.200
occurs in the remaining part.

05:35.600 --> 05:38.500
And if not, it can be just moved to a quite a far distance.

05:39.080 --> 05:42.380
Or if it occurs at some position, it will be just shifted by that

05:42.380 --> 05:46.460
distance until we have the right symbol there.

05:46.680 --> 05:51.100
And certainly we have to look whether the remaining part that you have

05:51.100 --> 05:59.020
already scanned is also reoccurring to the right of that symbol, DI,

05:59.200 --> 05:59.380
there.

05:59.840 --> 06:03.540
And this certainly is quite a requirement that this suffix of the

06:03.540 --> 06:08.640
pattern is reoccurring at some other position to the left of that.

06:09.460 --> 06:17.060
And so the shift distances here are rather large with respect to the

06:17.060 --> 06:18.000
occurrence heuristic.

06:18.140 --> 06:23.560
And we looked also at a match heuristic, where we actually look more

06:23.560 --> 06:27.300
on this suffix that has here occurred.

06:27.800 --> 06:29.640
So I have to correct that.

06:30.060 --> 06:36.420
The thing was that here we would, with the occurrence heuristic, we

06:36.420 --> 06:42.500
only look at the reoccurrence of that symbol and shift it again and

06:42.500 --> 06:46.740
then start comparing again at the rightmost position.

06:47.660 --> 06:54.840
And only in the next heuristic, with the match heuristic, we look

06:54.840 --> 06:59.060
whether the pattern or the suffix, which has already been scanned, if

06:59.060 --> 07:05.560
that reappears within the pattern and appears at a position where the

07:05.560 --> 07:09.260
symbol to the left of that suffix that had already been scanned

07:09.260 --> 07:13.040
successfully, where that is actually different.

07:13.460 --> 07:16.020
Because if it would be the same, we would have a mismatch again.

07:16.460 --> 07:18.880
This can be done by just looking at the pattern.

07:19.760 --> 07:24.280
And so here, with the match heuristic, we look whether the suffix

07:24.280 --> 07:26.080
reoccurs in the occurrence heuristic.

07:26.240 --> 07:30.140
We only look whether a certain symbol of the alphabet that might be in

07:30.140 --> 07:32.520
the text occurs in the pattern.

07:33.180 --> 07:37.720
And so we can compute a shift distance just with respect to those

07:37.720 --> 07:40.160
symbol occurrences.

07:41.120 --> 07:46.780
And the example here shows that the shift distances for this example,

07:46.780 --> 07:51.140
banana, which is a word where certain shift distances or certain

07:51.140 --> 07:55.020
suffixes reoccur, are quite large.

07:55.420 --> 07:57.440
Why are they that large?

07:57.680 --> 08:00.440
Or how do we get to these distances?

08:01.000 --> 08:03.540
If you look at this A here, it is a suffix.

08:04.520 --> 08:09.820
If you have a mismatch, you would look whether this suffix A reoccurs

08:09.820 --> 08:11.500
with a different symbol to its left.

08:12.300 --> 08:14.580
There it does reoccur, the A, but with an N.

08:14.720 --> 08:16.940
There it does reoccur, the A, with a B.

08:17.140 --> 08:20.340
That's the occurrence where you have a different symbol to the left.

08:20.520 --> 08:25.360
So the shifting would move that pattern to this position, such that

08:25.360 --> 08:30.300
you have now A here, as before, and you have N and B, different

08:30.300 --> 08:30.740
symbols.

08:31.260 --> 08:35.840
And you would re-compare, certainly starting at this rightmost symbol,

08:35.960 --> 08:45.060
but you know that at least those, like here, there might be some match

08:45.060 --> 08:50.900
if we actually get to that position in the repeated comparison that

08:50.900 --> 08:53.680
starts from the rightmost position.

08:54.380 --> 08:55.820
Okay, and here are the other things.

08:56.040 --> 09:00.420
So, for example, NA, that suffix, does not reoccur with a different

09:00.420 --> 09:02.940
symbol to its left.

09:03.480 --> 09:07.500
So in this case, we would just move the pattern all the way to the

09:07.500 --> 09:14.000
right, because there cannot be any match within that pattern anymore.

09:14.620 --> 09:16.520
And here are the other distances.

09:16.700 --> 09:23.460
I don't want to go through all those slides again.

09:23.820 --> 09:31.860
There was the preprocessing time, just an extension of the algorithm

09:31.860 --> 09:32.440
extension.

09:32.700 --> 09:36.240
If you have K patterns that you actually look at, like in a query, you

09:36.240 --> 09:37.640
would look for several terms.

09:38.060 --> 09:42.280
And if you try to look whether several patterns occur in a document,

09:42.720 --> 09:51.520
you can do that by just combining this idea and search for a tree of

09:51.520 --> 09:52.500
words, essentially.

09:54.200 --> 09:59.320
Here was another example for Boyer-Moore, which was in the recorded

09:59.320 --> 09:59.740
lecture.

09:59.900 --> 10:00.860
Here's another example.

10:01.580 --> 10:06.460
And this is the last slide that was in the recorded lecture.

10:06.580 --> 10:08.540
So let me start from that slide again.

10:08.640 --> 10:12.620
This comparison of Knuth, Morris, Pratt, and Boyer-Moore.

10:14.300 --> 10:19.740
Yeah, this is... I have to set the color to the right color.

10:20.480 --> 10:21.420
Oops, there it is.

10:21.520 --> 10:22.360
I would like that one.

10:22.840 --> 10:24.340
So here we have Knuth, Morris, Pratt.

10:24.800 --> 10:26.680
And there we have Boyer-Moore.

10:27.140 --> 10:31.140
And we would like to look how they compare to each other.

10:31.640 --> 10:36.140
So Knuth, Morris, Pratt compares patterns from left to right, Boyer

10:36.140 --> 10:37.060
-Moore from right to left.

10:37.200 --> 10:40.500
Although, if you look at extended versions of those pattern matching

10:40.500 --> 10:45.880
algorithms, actually the ideas of Boyer-Moore have been also extended

10:45.880 --> 10:51.320
and actually transferred to an algorithm where you also compare from

10:51.320 --> 10:52.520
left to right.

10:52.600 --> 10:56.820
So you can actually extend those algorithms, but I cannot show you all

10:56.820 --> 10:58.360
the different pattern matching algorithms.

10:58.520 --> 11:03.320
I just wanted to show you how this idea can be utilized that leads to

11:03.320 --> 11:07.680
new insights on how we can actually exploit information that we have

11:07.680 --> 11:11.500
looked at in a pattern in a way such that we can reduce the shift

11:11.500 --> 11:16.040
distance or reduce the number of comparisons of symbols significantly.

11:16.700 --> 11:23.980
So here in Knuth, Morris, Pratt, we are looking for the next symbol of

11:23.980 --> 11:27.840
the pattern that has to be compared at this current position, whereas

11:27.840 --> 11:33.500
here for Boyer-Moore, we define the shift distance of the pattern and

11:33.500 --> 11:37.880
we always restart at the rightmost position of the pattern, and then

11:37.880 --> 11:41.800
go back to the left with the comparisons.

11:43.620 --> 11:49.000
Here, the next function depends only on the pattern.

11:49.760 --> 11:53.980
We only have to analyze the pattern and then define that function.

11:54.500 --> 12:01.920
For Boyer-Moore, if we look at this shift distance function, sigma of

12:01.920 --> 12:06.160
j and c, that j is the current position in the pattern and c is the

12:06.160 --> 12:06.500
symbol.

12:07.160 --> 12:12.540
So here we have current or dependence on these two arguments, not just

12:12.540 --> 12:16.080
the pattern, but also the different symbols that could occur in the

12:16.080 --> 12:16.420
document.

12:17.700 --> 12:19.660
Runtime definitely is very different.

12:19.920 --> 12:21.960
Always linear time for Knuth, Morris, Pratt.

12:22.460 --> 12:26.520
You could say quadratic or product of document size and pattern size.

12:28.460 --> 12:31.900
But the important point is that on the average, you have sublinear

12:31.900 --> 12:34.240
time, which means very fast pattern matching.

12:35.000 --> 12:39.220
And if you look at worst cases or good cases and worst cases, so the

12:39.220 --> 12:45.700
good case here is like for Knuth, Morris, Pratt.

12:45.840 --> 12:49.440
If you have something like this, you can usually shift quite far

12:49.440 --> 12:54.460
because you don't have these prefixes that reoccur.

12:54.720 --> 13:02.760
In this case, here in Boyer-Moore, if you have only A's in there and

13:02.760 --> 13:06.220
you have a text which contains only very few A's, definitely you will

13:06.220 --> 13:12.860
always get quite a few shifts because if there are only few A's and

13:12.860 --> 13:16.740
other symbols, then these other symbols will always lead to a large

13:16.740 --> 13:17.760
shift of the pattern.

13:19.740 --> 13:23.860
Knuth, Morris, Pratt is bad for exactly those patterns because here

13:23.860 --> 13:28.500
you have suffix or prefixes which will definitely reoccur.

13:29.260 --> 13:31.120
And so here you have

13:34.440 --> 13:37.840
a worst or a bad case, which is good for the other one.

13:38.420 --> 13:42.860
And like something or a pattern like this one here where you have

13:42.860 --> 13:47.860
identical symbols and then only at the first position a difference and

13:47.860 --> 13:51.440
you compare it in this document where you have only A's, then you

13:51.440 --> 13:55.140
would have to scan from right to left every time at every position.

13:55.260 --> 13:59.620
You can only shift by one symbol due to those heuristics and that

13:59.620 --> 14:04.060
means you would have the worst case processing time for that type of

14:04.060 --> 14:04.300
pattern.

14:05.180 --> 14:11.240
Okay, this was pattern matching and pattern matching is something

14:11.240 --> 14:16.480
which has to do with information retrieval, but the point is that here

14:16.480 --> 14:24.520
you would have to process... oops, I switched it to red.

14:25.800 --> 14:26.600
This is interesting.

14:26.820 --> 14:30.220
I would just make a comment on that after I just watch it and see what

14:30.220 --> 14:30.540
happens.

14:31.140 --> 14:36.300
Okay, so the point that we had here was that we analyzed the pattern

14:36.300 --> 14:39.340
to search arbitrary documents as fast as possible.

14:39.580 --> 14:44.620
That means at the moment where we have a pattern, we analyze it or

14:44.620 --> 14:51.200
construct essentially the algorithm for or the device for pattern

14:51.200 --> 14:54.780
matching and then can search arbitrary documents very fast.

14:56.520 --> 15:01.020
Now what that means at the time where we ask a query, we would have to

15:01.020 --> 15:05.220
actually compute something and then search all the documents.

15:05.800 --> 15:09.780
So the different approach that one should look at or could look at is

15:09.780 --> 15:14.240
to analyze the document to allow for an efficient search for arbitrary

15:14.240 --> 15:14.680
patterns.

15:15.000 --> 15:19.020
And so there you do some pre-processing, not of all the queries, but

15:19.020 --> 15:22.980
of all the documents to allow for an efficient search afterwards.

15:23.760 --> 15:26.500
And I have to tell you now how this actually has to be done.

15:27.240 --> 15:30.460
Now we have to look at how we can actually search a document, how we

15:30.460 --> 15:33.640
can analyze a document, what kinds of things do we actually need.

15:34.320 --> 15:41.140
And so we have to first of all divide a document into words.

15:41.660 --> 15:44.040
Later on we would like to search for words.

15:45.200 --> 15:48.100
And so we have to know what is a word actually.

15:48.600 --> 15:50.640
What are the words contained in a document?

15:51.120 --> 15:54.520
Then we could, if you have a database for that, we can just look in

15:54.520 --> 15:57.520
that database in that index and search for the word.

15:57.600 --> 16:00.820
But for that we need to know what is a word in a document.

16:01.520 --> 16:05.080
If we look at this slide, you certainly see immediately this is a

16:05.080 --> 16:07.860
word, that is a word, that is one, that is one, that is one.

16:08.360 --> 16:18.380
But for example, words is the word, and the dot there, the full stop,

16:18.500 --> 16:20.320
is not part of that word.

16:20.740 --> 16:24.320
Or here the comma is not part of compiling.

16:25.420 --> 16:29.780
We see that immediately, but we would have to tell the algorithm which

16:29.780 --> 16:36.980
is separating our document into words that those symbols don't count

16:36.980 --> 16:37.580
as words.

16:37.760 --> 16:42.140
Or for example here, the sequence opening parenthesis and corresponds

16:43.220 --> 16:46.760
only has the word corresponds and not this opening bracket.

16:48.300 --> 16:52.780
And so we have to define reasonable ways of actually determining what

16:52.780 --> 16:53.440
a word is.

16:53.840 --> 16:57.500
That means we have to cancel out certain extra symbols.

16:58.400 --> 17:00.500
So how could we separate a document into words?

17:00.640 --> 17:07.020
We would just say certain special characters should be taken out here,

17:07.400 --> 17:08.960
colon, blank, or hyphen.

17:11.440 --> 17:15.460
Just to make this remark, that I said I would make a remark on these,

17:16.800 --> 17:20.120
that all of a sudden it switched back from red to black.

17:20.700 --> 17:24.980
Yesterday I used this computer, this laptop here the first time, it's

17:24.980 --> 17:32.040
a new one, and I used that with Office 2010, PowerPoint 2010.

17:32.700 --> 17:40.180
And I noticed that in PowerPoint 2010, all of a sudden, first of all,

17:40.340 --> 17:46.280
if I look here at the different pens I can take, the fildstift is no

17:46.280 --> 17:46.860
longer there.

17:47.020 --> 17:47.580
It disappeared.

17:48.580 --> 17:50.440
So I only have a stift, a pen.

17:51.220 --> 17:56.720
I cannot have this special type of pen, which provides the same

17:56.720 --> 17:59.320
properties as a fildstift, different from a pen.

18:00.280 --> 18:00.980
It disappeared.

18:01.440 --> 18:06.120
Another thing that disappeared is the possibility, so I still have the

18:06.120 --> 18:13.560
stift now here, is the possibility to have a permanent switch to the

18:13.560 --> 18:13.760
pen.

18:14.380 --> 18:21.080
After every animation click, this setting switched back to the cursor.

18:22.540 --> 18:26.060
And I had to re-select the pen after every animation.

18:27.500 --> 18:31.680
This is a bug in Windows, or in Office 2010.

18:32.420 --> 18:40.140
Obviously nobody who designed that ever used a tablet or used a pen to

18:40.140 --> 18:41.740
annotate a presentation.

18:42.780 --> 18:45.040
Windows designed the tablet PC software.

18:46.180 --> 18:50.300
They came up as the first to design a perfect software for that, and

18:50.300 --> 18:54.580
they forgot obviously that somebody would actually use that, and would

18:54.580 --> 18:59.800
use that in a way such that you would actually like to have a pen even

18:59.800 --> 19:03.320
after you made an animation, or made a click there.

19:03.560 --> 19:04.460
It's ridiculous.

19:05.660 --> 19:07.960
Now there have been long discussions, we noticed that.

19:08.060 --> 19:11.720
There have been long discussions in certain forums on that, and

19:11.720 --> 19:16.160
somebody has come up with a solution so you can download from a

19:16.160 --> 19:20.300
different site, you can download some patch which actually switches

19:20.300 --> 19:23.420
back to the pen after every animation.

19:24.280 --> 19:28.960
So there's a special program running now on my laptop which attends

19:28.960 --> 19:33.980
the PowerPoint and just interferes with it to switch back to the pen

19:33.980 --> 19:36.120
every time I make an animation click.

19:36.480 --> 19:38.600
It's ridiculous, but things like that happen.

19:40.500 --> 19:45.580
And another thing is that the recording software, I also got a new

19:45.580 --> 19:49.960
version of the recording software, and all of a sudden I can no longer

19:49.960 --> 19:55.960
just with one click store it as an AVI file.

19:56.840 --> 20:02.920
I can only store it as a Camtasia project file and use Camtasia, this

20:02.920 --> 20:08.080
software, studio software, to then process it and produce a file in

20:08.080 --> 20:09.820
many different formats.

20:10.380 --> 20:14.160
That means in some way they want to prevent that somebody would just

20:14.160 --> 20:18.860
record with Camtasia, produce AVI, and use a different tool for

20:18.860 --> 20:19.680
further processing.

20:20.220 --> 20:24.080
They want to make sure that you use the studio software of Camtasia

20:24.080 --> 20:31.940
for further processing, and then it takes about half an hour for just

20:31.940 --> 20:36.320
for rendering the document to produce the AVI file that you'd like to

20:36.320 --> 20:38.400
get, or an MP4 file, or whatever.

20:40.060 --> 20:44.680
This is a new version of the software, you pay extra for that, and all

20:44.680 --> 20:48.920
of a sudden you notice the functionality is reduced in order to make

20:48.920 --> 20:53.220
sure that you are really only using that software.

20:53.720 --> 20:56.240
This is such a bad policy.

20:57.840 --> 21:01.460
One really should say, I don't want that software anymore, but I need

21:01.460 --> 21:01.840
it.

21:02.600 --> 21:06.860
But this is just annoying, and I don't know who is designing a

21:06.860 --> 21:07.640
strategy like that.

21:08.340 --> 21:09.560
This is ridiculous, really.

21:10.160 --> 21:15.780
Okay, so since this is available everywhere in the world, so I'm happy

21:15.780 --> 21:18.340
to make these comments on bad software development.

21:20.260 --> 21:25.480
Okay, so let's come back to how we can actually separate a document

21:25.480 --> 21:26.600
into words.

21:26.940 --> 21:29.000
So here you see the problems that we might run into.

21:29.080 --> 21:34.840
The first idea certainly is that we know, okay, a colon is never part

21:34.840 --> 21:36.580
of a word.

21:36.760 --> 21:38.300
A blank is not part of a word.

21:38.400 --> 21:39.700
A hyphen, definitely not.

21:40.360 --> 21:43.660
But then look at something like this, Jean-Claude.

21:44.900 --> 21:49.060
There we have a hyphen in there, and it should be, is that one word?

21:50.880 --> 21:52.600
Should be in some sense one word.

21:52.760 --> 21:53.960
Or state of the art.

21:55.060 --> 21:59.920
Here it would not be reasonable to separate that into state art and of

21:59.920 --> 22:00.100
the.

22:00.220 --> 22:02.540
Here you would need state of the art as one word.

22:03.780 --> 22:04.920
Or MS-DOS.

22:05.140 --> 22:08.800
Nobody looked for MS-DOS, but maybe you have words like that.

22:09.260 --> 22:11.680
And then this would be the word you're actually looking for.

22:12.120 --> 22:16.400
So since you can conceive that somebody would look for that word, you

22:16.400 --> 22:19.740
would need some way of actually detecting that that is a word.

22:20.800 --> 22:23.100
So you have to do something about that.

22:23.240 --> 22:24.420
Or something like this.

22:24.560 --> 22:31.800
If you separate a word, yeah, here, Arbeitsrecht, then you insert a

22:31.800 --> 22:32.460
hyphen in there.

22:34.120 --> 22:35.580
That's not part of the word.

22:35.680 --> 22:38.820
And you should not, you should not in this case, store it as part of

22:38.820 --> 22:42.880
the word, but should we ignore it and just delete it because it's just

22:42.880 --> 22:45.800
because we have had to split the word over two lines.

22:46.520 --> 22:47.940
Or something like this.

22:48.420 --> 22:49.680
Chapter 2.3.

22:51.480 --> 22:53.520
Is this just the word chapter?

22:54.100 --> 22:57.440
And then the word two and the word three?

22:57.860 --> 23:04.900
Because we said we would not consider these punctuation symbols as

23:04.900 --> 23:05.840
relevant symbols.

23:06.280 --> 23:10.060
So here, I just wanted to show you that you have to look carefully

23:10.060 --> 23:13.800
what you are doing when you are separating a document into words.

23:14.680 --> 23:19.960
And so obviously something has to be done there.

23:21.220 --> 23:26.380
And then you could say you only look at strings without digits.

23:26.580 --> 23:30.880
Certainly we had already here, like chapter 2.3 has digits in there.

23:31.720 --> 23:32.640
BS 2000.

23:32.880 --> 23:35.120
Who knows what BS 2000 is?

23:36.880 --> 23:39.400
None of you knows what BS 2000 is.

23:39.440 --> 23:42.440
No, I don't see all those who will listen to this lecture.

23:42.920 --> 23:45.660
But those that are here in class don't know what it is.

23:46.120 --> 23:49.320
It is one of the most famous German operating systems.

23:49.560 --> 23:50.680
Betriebssystem 2000.

23:51.520 --> 23:57.840
Siemens, our major company of computers, mainframe computers for

23:57.840 --> 24:04.520
enterprises, had designed BS 2000, an operating system very similar to

24:04.520 --> 24:09.180
the operating system of the IBM 360, famous operating system.

24:09.840 --> 24:14.200
And most of the enterprises in Germany have been running BS 2000 for

24:14.200 --> 24:15.000
decades.

24:15.960 --> 24:18.160
And so it's a very important operating system.

24:18.820 --> 24:21.100
F16, I guess you know what F16 is.

24:21.500 --> 24:23.100
H2O, chemical formulas.

24:23.260 --> 24:24.320
You have digits in there.

24:24.820 --> 24:28.720
Somehow you have to store these digits in reasonable ways, or you have

24:28.720 --> 24:33.020
to notice that this is a reasonable word to look at.

24:34.180 --> 24:39.600
So you could say you don't take digits as first characters.

24:39.700 --> 24:45.160
I just wanted to outline you have to be very careful if you actually

24:45.160 --> 24:46.980
separate a document into words.

24:48.060 --> 24:51.740
Now another point is capital or small letters.

24:52.180 --> 24:58.300
If you look at a document, would you actually distinguish words

24:58.300 --> 25:03.080
depending on whether the first character is a capital letter or a

25:03.080 --> 25:03.580
small letter?

25:04.340 --> 25:05.460
Sometimes you would.

25:06.000 --> 25:13.360
If you would make a search where you say it should be case sensitive,

25:14.440 --> 25:18.680
then you explicitly only want to look for words actually having this

25:18.680 --> 25:19.880
capital letter in there.

25:20.260 --> 25:24.220
But usually you would say it's not case sensitive and then you would

25:24.220 --> 25:25.620
just look for that.

25:26.000 --> 25:30.460
But depending on what you did, you certainly would have to separate

25:30.460 --> 25:36.180
that in the way you actually produce your index for that document.

25:37.860 --> 25:49.480
So essentially you would analyze your document and just remark it's

25:49.480 --> 25:51.280
not that difficult to actually do that.

25:51.360 --> 25:56.440
If you just use certain separating symbols, you can do that using the

25:56.440 --> 26:01.980
string tokenizing function of Java to just specify the separators and

26:01.980 --> 26:04.720
then you get very easily the separation into words.

26:05.580 --> 26:10.420
So it is something similar to a lexical analysis, like scanning

26:10.420 --> 26:16.800
through a document looking at special characters, but this is a

26:16.800 --> 26:19.600
function which is available in Java and also in other programming

26:19.600 --> 26:20.040
languages.

26:21.340 --> 26:25.500
So that is one step which is done quite easily.

26:25.940 --> 26:29.400
Not very interesting algorithms, but you just have to think a bit what

26:29.400 --> 26:33.320
is the important information you have to use for separating the

26:33.320 --> 26:33.680
document.

26:34.260 --> 26:39.060
Another point is that you have to look for relevant words.

26:39.300 --> 26:42.560
You are never interested in all the words that occur in a document.

26:42.780 --> 26:46.800
For example, you see it switched back to black.

26:48.280 --> 26:51.560
You have to notice at what time it switches to black.

26:51.960 --> 26:53.680
Maybe when I start a new page.

26:53.820 --> 26:54.940
We have to watch that.

26:56.060 --> 26:56.940
This is interesting.

26:59.320 --> 27:02.860
You always introduce interesting features so that you get surprised

27:02.860 --> 27:04.080
when you use a new software.

27:05.420 --> 27:12.880
So the answer to what is relevant depends certainly on what you say is

27:12.880 --> 27:13.240
relevant.

27:13.700 --> 27:15.680
So of or if is not relevant.

27:15.920 --> 27:16.760
On is not relevant.

27:17.920 --> 27:18.920
All is not relevant.

27:20.380 --> 27:24.680
You could cut out those words, but you need a definition for that if

27:24.680 --> 27:27.400
you want to scan a document and say that is irrelevant.

27:30.940 --> 27:35.920
And so you can do that, but just you can do that on the fly while you

27:35.920 --> 27:39.780
are analyzing the document can be integrated into the lexical

27:39.780 --> 27:40.220
analysis.

27:41.160 --> 27:44.220
So here is I don't claim that this is actually.

27:44.800 --> 27:45.860
What is this?

27:50.360 --> 27:51.480
Another surprise.

27:52.120 --> 27:53.400
And you see it's black again.

27:55.160 --> 27:56.220
This is actually funny.

27:58.560 --> 28:03.640
So I have the I assume now that this happens every time I go to the

28:03.640 --> 28:04.460
next slide.

28:05.460 --> 28:06.840
Maybe that's the case.

28:07.160 --> 28:07.980
So now it's red.

28:08.700 --> 28:13.260
So I want would like to disregard those words here.

28:13.440 --> 28:16.500
Say this is just a small example of how you could do that.

28:16.600 --> 28:19.840
You could design something similar to a finite automaton, transform

28:19.840 --> 28:24.160
that into software where you would just say, OK, as soon as I discover

28:24.160 --> 28:29.500
something which is just an A, I would stop if it's just like it would

28:29.500 --> 28:30.280
be cut out.

28:31.420 --> 28:34.320
Then if you have an it would be cut out.

28:34.440 --> 28:36.340
If you have and it would be cut out.

28:36.840 --> 28:38.080
If you would have a.

28:39.720 --> 28:46.880
Here, just I would not be cut out, but in would be cut out or would be

28:46.880 --> 28:47.480
cut out.

28:47.760 --> 28:53.660
And in this case, like in would also be cut out.

28:54.980 --> 28:59.380
OK, this is just showing a simple way of specifying that those words

28:59.380 --> 29:03.700
are those that you would like to figure out.

29:04.340 --> 29:09.820
And that means you just look for these words in some document and then

29:09.820 --> 29:10.560
cut them out.

29:10.780 --> 29:15.900
It is essentially the same as what we did before when when when when I

29:15.900 --> 29:19.980
said we are looking for pattern matching in documents for certain

29:19.980 --> 29:21.460
words that occur there.

29:21.660 --> 29:25.620
If we look for several words for several patterns at the same time,

29:25.820 --> 29:29.120
this is some kind of automaton that would look for several words at

29:29.120 --> 29:29.840
the same time.

29:30.420 --> 29:33.740
And every time it is in the final state, it would say you can cut out

29:33.740 --> 29:34.220
that word.

29:35.180 --> 29:37.840
OK, filtering is not that difficult.

29:38.700 --> 29:40.240
Stemming, another operation.

29:40.940 --> 29:43.620
I have to mention all these possible operations that could be could

29:43.620 --> 29:44.020
occur.

29:44.720 --> 29:49.980
So stemming is something where you try to generalize your search.

29:51.120 --> 29:52.480
Why do you need that?

29:52.600 --> 30:00.180
So if, for example, you would look at the word examples, then it does

30:00.180 --> 30:02.860
make sense not only to look at examples.

30:03.380 --> 30:04.320
It's black again.

30:07.220 --> 30:10.540
But you could also look at example.

30:11.560 --> 30:16.980
Or if you look for replacement, it would be reasonable to just look

30:16.980 --> 30:18.160
for replace.

30:20.140 --> 30:23.920
And then replacement would be one of those occurrences.

30:24.580 --> 30:28.320
And so in this way, you would generalize the search because you would

30:28.320 --> 30:33.520
look for all the different words that relate to the same word stem.

30:34.560 --> 30:37.740
And I'll give you several examples for that.

30:38.620 --> 30:42.500
So if you have a suffix of a word, and this is just something which

30:42.500 --> 30:45.100
has to be defined for the different languages.

30:45.300 --> 30:53.600
So for English language, a suffix IES would be replaced just by an I.

30:53.740 --> 30:57.320
So if you have a word ponies, you would say this reduces to pony.

30:57.820 --> 31:01.240
If you have an S in the end, you could just cut out that S.

31:02.260 --> 31:06.460
So this could also be, it could have been done here at pony with IES.

31:07.520 --> 31:10.880
You could have cut out just the S, but since you have here the IES,

31:11.540 --> 31:13.460
you would just replace that with an I.

31:14.400 --> 31:19.460
If you have an ED at the end, you could just, here for example, in

31:19.460 --> 31:23.280
such a word to just erase the ED in the end and get plaster at one.

31:24.420 --> 31:25.940
Here's a list of more of that.

31:26.220 --> 31:33.800
If you have here all kinds of suffixes of a word, then you would just

31:33.800 --> 31:36.360
modify that.

31:36.460 --> 31:40.440
So for example here, an AT at the end.

31:40.560 --> 31:46.620
For example, conflated would be reduced using that rule, this rule

31:46.620 --> 31:49.100
here, to conflat.

31:49.660 --> 31:54.360
And then actually it would be extended again to conflate.

31:55.480 --> 31:57.800
Okay, rules like that make sense for English.

31:58.580 --> 32:05.820
So you see here all kinds of different endings that actually would be

32:05.820 --> 32:11.540
replaced in a reasonable way to, for example, fullness, to full, or

32:11.540 --> 32:17.520
sensibility, for example, here this word would reduce to sensible.

32:18.340 --> 32:20.620
Things like that would be done here.

32:20.980 --> 32:24.140
So here, or the Y is replaced by an I.

32:24.300 --> 32:29.400
Then you have here, for example, happy, because happiness, they would

32:29.400 --> 32:31.620
just add the end in the end.

32:32.300 --> 32:39.200
Or you would have the ness as something which extends the stem happy.

32:39.660 --> 32:43.760
In this way, here on the right hand side, here you always have the

32:43.760 --> 32:48.440
word stems, and obviously in this way you would generalize your

32:48.440 --> 32:53.960
search, because if you would look for motoring, for example, here, if

32:53.960 --> 32:57.140
you would look for motoring, you would also get hits where you only

32:57.140 --> 32:58.420
have the word motor in there.

32:58.420 --> 33:03.640
Maybe it's reasonable, yeah, it will lead to a larger set of

33:03.640 --> 33:10.860
documents, so maybe it's not as precise, but maybe it extends the

33:10.860 --> 33:14.260
recall, because you would get more documents that are actually

33:14.260 --> 33:15.980
relevant to your search.

33:16.960 --> 33:22.040
So advantages are that you have to memorize in your index fewer words,

33:22.040 --> 33:27.240
because many possible words reduce to the same stem, can reduce the

33:27.240 --> 33:32.280
search factors significantly, and you also have this automatic

33:32.280 --> 33:36.280
extension of your searches to related words, so have a generalized

33:36.700 --> 33:37.040
search.

33:38.780 --> 33:43.260
Now, the next question is how we can actually build an efficient

33:43.260 --> 33:44.940
search structure.

33:45.960 --> 33:47.920
You can do that in many different ways.

33:48.020 --> 33:53.780
You can just build such an index and then have some some list in some

33:53.780 --> 33:54.140
way.

33:55.260 --> 33:58.360
So what you would like to do is you would like to specify...

34:00.180 --> 34:01.320
what's happening here?

34:08.650 --> 34:09.710
Aha, now it comes.

34:10.610 --> 34:13.110
It's doing something in the background, which is...

34:13.110 --> 34:19.470
maybe I overused it in the moment, but let's hope that it's recording

34:19.470 --> 34:19.910
correctly.

34:20.710 --> 34:22.730
So we have here...

34:22.730 --> 34:27.170
now I even don't have the... there's the pen back again, but in black.

34:28.950 --> 34:31.550
So I think I just leave it at black.

34:32.110 --> 34:36.050
It's not that dangerous to have black annotation, but it's not as

34:36.050 --> 34:37.830
visible as the red annotation.

34:38.850 --> 34:41.090
Okay, so the inverted file would do the following.

34:41.430 --> 34:46.910
It would specify for every word of the document where it occurs.

34:48.170 --> 34:54.350
It would indicate the relevance, some weight indicating how relevant

34:54.350 --> 34:56.510
this word is for the document.

34:57.150 --> 35:02.970
It could have an indication of the frequency at which it occurs in the

35:02.970 --> 35:07.970
document, which might be related also to the relevance, but that those

35:07.970 --> 35:12.670
might be different, might be unrelated indicators.

35:13.350 --> 35:18.970
So you would have a list of words where you have these links to the

35:18.970 --> 35:21.990
locations where it occurs and the relevance.

35:23.310 --> 35:25.570
And then the question is, what do you do with that list?

35:25.690 --> 35:28.250
Like if you have a book, you would have in the end, you would have

35:28.250 --> 35:29.270
just a long list.

35:29.390 --> 35:33.350
You would have all the entries there, and then you would have a list

35:33.350 --> 35:38.670
of those indicators telling you something about those words where they

35:38.670 --> 35:40.290
actually occur in the document.

35:41.290 --> 35:47.450
So that information would be listed for every word in the index.

35:48.610 --> 35:50.290
The question is how you do that.

35:50.370 --> 35:53.090
You could have a sorted list with random axes where you could do

35:53.090 --> 35:54.550
binary search on the index.

35:56.190 --> 36:02.430
Would be okay, but it has to be administered if, like if you have

36:02.430 --> 36:07.350
frequent changes of that list, frequent additions, frequent new

36:08.150 --> 36:12.490
information on which documents actually, on which document does this

36:12.490 --> 36:17.610
word occur, then you would have quite an administrative overhead,

36:17.930 --> 36:20.230
because you would have to re-sort the list all the time.

36:20.950 --> 36:25.390
And this is not that reasonable, so you would have to do something

36:25.390 --> 36:29.370
about it and come up with better ideas.

36:29.550 --> 36:33.170
And we know that for searching, we have tree structures usually, and

36:33.170 --> 36:37.050
there are special tree structures that are adequate for those index

36:37.050 --> 36:38.270
systems.

36:39.190 --> 36:46.110
Okay, so let's look at possible tree structures next.

36:46.490 --> 36:50.630
Oh, by the way, there are specific here that I mentioned two tree

36:50.630 --> 36:56.550
structures, so-called B-trees, which are in particular very adequate

36:57.010 --> 37:02.830
for very, very large systems where you have long lists of entries,

37:03.890 --> 37:08.250
large information, information, or collections of information.

37:08.770 --> 37:13.710
B-trees allow for efficient access to very large databases because

37:13.710 --> 37:19.810
where the number of accesses to hard disks are reduced.

37:20.630 --> 37:29.330
And then there's a specific type of tree, which are so-called tris, or

37:29.330 --> 37:33.530
you could say it actually comes from retrieval, so you could say it's

37:33.530 --> 37:41.650
a tree, but in this time, in this case, it is spelled as t-r-i-e to

37:41.650 --> 37:45.110
indicate that this has something to do with text or information

37:45.110 --> 37:45.870
retrieval.

37:46.610 --> 37:52.330
And I will show you in particular the so-called Patricia trees, or you

37:52.330 --> 37:53.970
could say tri or trees.

37:54.310 --> 37:58.670
I will stick to trees because it comes from retrieval.

37:59.350 --> 38:06.670
And here, the specific Patricia trees are designed looking at

38:06.670 --> 38:12.810
properties of longer documents, of text documents, and separation of

38:12.810 --> 38:16.870
those text documents in a reasonable way such that you can actually

38:16.870 --> 38:20.170
find the relevant information in a very efficient way.

38:21.750 --> 38:26.390
I would like to start briefly by showing you the B-tree concept, maybe

38:26.390 --> 38:30.910
you have seen that already, but I just want to show you on this one

38:30.910 --> 38:35.150
slide on that topic, how we can, or what kind of structure that

38:35.150 --> 38:35.850
actually is.

38:36.110 --> 38:44.670
The name derives from Rudolf Bayer, who is a scientist at Munich, or

38:44.670 --> 38:46.810
has been an active scientist there.

38:48.370 --> 38:49.730
He's long in retirement.

38:50.450 --> 38:51.770
B-trees, Bayer trees.

38:52.730 --> 38:58.330
And B-trees have been designed for, as I said, large data sets in

38:58.330 --> 39:02.790
order to reduce the number of accesses to hard disk.

39:03.770 --> 39:08.830
And the idea is to have, like, if we have a tree, if I just indicate

39:08.830 --> 39:13.410
here some kind of tree structure, normally, you would say there's a

39:13.410 --> 39:20.450
specific piece of information in that, in every node, maybe one symbol

39:20.450 --> 39:20.870
or so.

39:21.010 --> 39:24.730
And in the case of B-trees, we have very much information in the node.

39:24.870 --> 39:29.270
Every node contains essentially as much information as you can

39:29.270 --> 39:32.490
retrieve with one access to hard disk.

39:32.590 --> 39:34.350
So one block of data, for example.

39:34.670 --> 39:38.010
Usually, you retrieve from secondary storage, you retrieve one block

39:38.010 --> 39:41.790
of data at the same time, or some just as one unit.

39:42.550 --> 39:49.550
And so here, the information that is put into these nodes of a tree,

39:49.990 --> 39:54.010
corresponds to the information in one block of data.

39:55.130 --> 39:59.370
So you have to specify this block of data that should be in such a

39:59.370 --> 39:59.630
node.

40:00.190 --> 40:05.290
So, and you should also make sure that you have some kind of balance

40:05.290 --> 40:11.010
structure in order to make sure that the number or the access time is

40:11.010 --> 40:15.910
more or less independent of the specific location of data.

40:16.710 --> 40:20.670
And for that, we have a balanced tree.

40:20.890 --> 40:25.270
So this is always balanced, that means the number of links from root

40:25.270 --> 40:34.030
to leaves of that tree are the same.

40:34.270 --> 40:39.690
So you have all the paths here in that tree are the same, you have for

40:39.690 --> 40:42.050
all leaves the same depth.

40:42.970 --> 40:46.970
But the sizes of the nodes that you visit on the path from the root to

40:46.970 --> 40:48.990
the leaf may be different.

40:49.770 --> 40:56.030
So every inner node here, except for the root, has at least M over 2

40:56.030 --> 41:00.330
successors, where M is the degree of that B tree.

41:01.130 --> 41:05.150
The root has at least two successors, all the others will have at

41:05.150 --> 41:07.090
least M over 2.

41:07.830 --> 41:12.170
And there are at most M successors to every node, so you may have more

41:12.170 --> 41:16.390
than, definitely more than two, but not more than M.

41:17.090 --> 41:22.650
And if you have a node with K successors, you have K minus one keys in

41:22.650 --> 41:22.890
there.

41:23.050 --> 41:34.450
That means, if you have some node having, for example, keys K1 and so

41:34.450 --> 41:39.790
on, K... no, it should now have KK.

41:40.510 --> 41:41.470
That doesn't make sense.

41:42.330 --> 41:53.270
So, if I have, take the German abbreviation, S1 to SK minus one, then

41:53.270 --> 42:03.210
I have here every key will be used as a separator, so you would just,

42:03.210 --> 42:08.970
if you enter such a node, and you look for a specific entry, you would

42:08.970 --> 42:17.010
just search for the key in there that, for the first key that is

42:17.010 --> 42:25.930
smaller than, well, okay, the first key that is larger, you could say,

42:26.090 --> 42:30.030
the first key that is larger than the information you are looking at,

42:30.030 --> 42:32.090
or you are looking for.

42:32.610 --> 42:37.610
So, if it's smaller than S1, you would go to this link here, the first

42:37.610 --> 42:37.890
link.

42:37.970 --> 42:42.090
If it's larger than S1, but less than S2, you take the second one, and

42:42.090 --> 42:42.470
so on.

42:42.850 --> 42:50.710
And if you have SK minus one keys in here, you have exactly one, two,

42:51.010 --> 42:55.530
to K different links going out of that.

42:55.690 --> 42:56.410
Very simple thing.

42:56.790 --> 42:57.890
Spent too much time on that.

42:58.290 --> 43:03.610
So, if you have an example for M equals three, very simple size, very

43:03.610 --> 43:10.050
small size, normally for a B-tree, you would have something like 30,

43:10.230 --> 43:15.010
50, 100, or whatever, depending on the size of the individual keys.

43:15.530 --> 43:21.890
So, the number of bytes usually should be something like 1K or maybe

43:21.890 --> 43:24.350
2K bytes in there.

43:25.090 --> 43:27.290
So, it would be a larger structure.

43:27.750 --> 43:31.710
Now, if you have such a structure, how would that look like?

43:31.790 --> 43:35.390
Like here, for example, we could have something like...

43:39.510 --> 43:53.350
For example, if we would have here 15, 25, for example, 35, and here,

43:53.610 --> 43:54.630
let's say, maybe

43:57.950 --> 44:06.670
5, 10, and 13, for example, and here we take something like smaller

44:06.670 --> 44:16.150
than 25, so it should be, let's say, 10 and... no, sorry, not 10, come

44:16.150 --> 44:19.250
on, larger than 15.

44:19.770 --> 44:31.990
So, for example, here we would take 17 and 30, and here, for example,

44:35.090 --> 44:42.390
32, for example, no, it doesn't make sense.

44:43.110 --> 44:44.490
It's 35 up there.

44:45.650 --> 44:53.170
So, I should take here, for example, all larger values, just maybe 40,

44:53.610 --> 45:00.690
50, 60, so we would have something which would be according to that

45:00.690 --> 45:01.030
structure.

45:07.370 --> 45:07.910
Thank you.

45:09.350 --> 45:09.630
Yes.

45:12.650 --> 45:14.490
Come on, this is...

45:15.210 --> 45:18.850
I did some... no, what did I do here?

45:20.370 --> 45:22.410
Sorry, I did something completely...

45:24.390 --> 45:26.050
I have to erase all that.

45:26.650 --> 45:29.710
Sorry, I was just misled by those dots.

45:30.850 --> 45:34.010
I wanted to fill in, just forget what I said, restart.

45:34.670 --> 45:35.090
B3.

45:35.950 --> 45:37.590
This example, M equals 3.

45:37.750 --> 45:41.570
M equals 3, you should have complained about the different points.

45:42.010 --> 45:46.090
You should have complained, because I said, for M equals 3, they have

45:46.090 --> 45:50.910
at most M successors, and if you have M successors, you have only M

45:50.910 --> 45:51.650
minus 1 keys.

45:51.830 --> 45:54.230
The maximum number of keys you have in here is 2.

45:55.430 --> 46:01.530
And so, the dots indicate the numbers that I wanted to put in here.

46:02.450 --> 46:09.050
And so, assume here we have, for example, 15 as a number, and let's

46:09.050 --> 46:09.770
say 25.

46:10.350 --> 46:14.290
And here we have maybe 5 and 10.

46:14.730 --> 46:21.030
Here we have, let's say, just 17.

46:22.110 --> 46:27.850
And here, let's say, here we need larger, two larger numbers, for

46:27.850 --> 46:30.890
example, 30 and 35.

46:31.150 --> 46:32.070
Now this makes sense.

46:32.470 --> 46:34.790
Now we have the correct structure.

46:35.430 --> 46:40.310
Here we have nodes which have two entries in there.

46:41.210 --> 46:45.330
Those that have two entries have actually three successors.

46:45.610 --> 46:49.390
Here is one node having only one entry, and we have two successors.

46:50.430 --> 46:50.590
Yeah?

46:51.290 --> 46:53.530
This definitely makes sense.

46:54.950 --> 46:56.450
And now, if we would

47:00.490 --> 47:06.950
insert something, like just see, yeah, so we would like to modify that

47:06.950 --> 47:07.470
structure.

47:07.890 --> 47:13.050
How can we actually, or how can we, how can we manage such a

47:13.050 --> 47:13.310
structure?

47:13.450 --> 47:21.690
For example, you would like to input a new value.

47:21.910 --> 47:25.390
You would like to input the key 20.

47:26.010 --> 47:27.870
You would like to put 20 in there.

47:27.970 --> 47:29.170
You would like to insert something.

47:30.570 --> 47:34.510
Like if you just search for a value, it's no problem.

47:34.970 --> 47:39.170
We just find it in those nodes.

47:39.510 --> 47:44.090
If you would like to insert something, we would start at the root.

47:44.910 --> 47:48.430
Look, 20 is larger than 15, smaller than 25.

47:48.650 --> 47:50.230
We go into this node here.

47:50.650 --> 47:53.410
We notice there's a node having a 17 in there.

47:53.730 --> 47:59.590
It just has one key, and so we just enter the 17 in there as an

47:59.590 --> 48:03.030
additional node, or additional key.

48:03.610 --> 48:05.690
So in this case, I would enter here another.

48:05.910 --> 48:10.350
So we had 17 in there, and we would also include 20.

48:12.350 --> 48:18.510
And now I certainly would also have to extend this here by an

48:18.510 --> 48:19.790
additional leaf.

48:20.790 --> 48:25.550
Okay, this is simple, because this node here was small, and we just

48:25.550 --> 48:27.330
added one key there.

48:27.590 --> 48:29.870
Nothing else has to be changed.

48:30.470 --> 48:37.910
Now, let's say we would like to enter a key, which is the key 18.

48:39.730 --> 48:44.490
If we do that, we would have to look at 18.

48:44.810 --> 48:46.110
We go into this node.

48:46.650 --> 48:50.090
We notice 18 is larger than 15, is smaller than 25.

48:50.450 --> 48:51.610
We go into another node.

48:52.050 --> 48:54.790
We see 17 is smaller than 18.

48:55.090 --> 48:58.570
20 is larger, so we would have to insert something there.

48:59.290 --> 49:04.410
But we already have two nodes, two keys in that node, and so we'd have

49:04.410 --> 49:05.150
to do something.

49:05.830 --> 49:09.690
So in this case, we would have to split up that node.

49:09.850 --> 49:11.830
So let me just redraw this.

49:12.470 --> 49:16.390
We would have to, we have here this larger one, this larger node

49:16.390 --> 49:19.230
having our three nodes there.

49:19.230 --> 49:24.270
We have this larger one having three leaves there.

49:24.810 --> 49:27.990
And here we had the 17 and the 20.

49:29.170 --> 49:31.510
Now the thing that we do there is the following.

49:31.890 --> 49:36.510
We cannot insert the 18 there, because then we have a node having

49:36.510 --> 49:38.010
three keys inside.

49:39.230 --> 49:44.710
So we have to split that node into two new nodes, namely a node having

49:44.710 --> 49:51.930
only 17 in there, a node having 20 in there, having two leaves each,

49:52.290 --> 49:55.550
and so here these simple leaves in the end.

49:55.650 --> 49:57.230
I don't draw all these leaves.

49:58.490 --> 50:05.130
And then we have to move the 18, like the 18 which belongs somewhere

50:05.130 --> 50:07.830
here in the middle, this has to be moved up.

50:08.450 --> 50:11.770
To be moved up means it has to be inserted in that node.

50:12.510 --> 50:20.950
So we would get on the top here, where, well, before we only had here

50:20.950 --> 50:28.510
one, we had 15, and we had 25.

50:29.310 --> 50:37.570
Now we would move 18 in here, and we would have now these, well, four

50:37.570 --> 50:37.970
descendants.

50:38.250 --> 50:38.730
Doesn't work.

50:38.830 --> 50:39.270
Not allowed.

50:40.450 --> 50:43.130
We must not include three keys in there.

50:43.650 --> 50:46.130
So we have to do something else.

50:46.570 --> 50:50.830
We have the same situation that we had down here, where we had to

50:50.830 --> 50:51.590
split the node.

50:51.850 --> 50:57.810
So now we have to split that node, that root, and do something new.

50:58.050 --> 50:59.150
So what do we do?

50:59.770 --> 51:00.790
We have to split it.

51:01.550 --> 51:06.690
And to split it means we would have to take the 15 as one new node, we

51:06.690 --> 51:17.210
would have to take the 25 as one new node, having two descendants, and

51:17.210 --> 51:21.490
a new node containing the 18 as a new root node.

51:22.150 --> 51:27.970
And this splitting of a root node, when it becomes too large, always

51:27.970 --> 51:32.090
leads to a new node, a new root, having only two descendants.

51:32.830 --> 51:39.290
That's why we have here this rule that the root has at least two

51:39.290 --> 51:46.390
successors, or the others will always have at least m over 2

51:46.390 --> 51:47.270
successors.

51:47.910 --> 51:51.970
There you will never have the situation that you only move one, or

51:51.970 --> 51:56.490
that you only have one key in a node, but you will have at least m

51:56.490 --> 51:59.590
over 2 keys in a node.

52:00.470 --> 52:06.270
And this way you have this structure.

52:06.670 --> 52:11.670
Now here certainly we have several nodes having two descendants, two

52:11.670 --> 52:19.890
successors, because m over 3 over 2 is 2 if we take the next integer

52:19.890 --> 52:20.430
for that.

52:20.830 --> 52:25.570
So here we have those structures, but I just wanted to indicate how we

52:25.570 --> 52:30.870
actually produce a new level in that structure.

52:31.450 --> 52:35.730
And as you see immediately, since we only add something at the top, we

52:35.730 --> 52:36.710
add a new root.

52:37.630 --> 52:42.790
The distance of the leaves from the root, for all the leaves, is

52:42.790 --> 52:43.710
always the same.

52:44.890 --> 52:48.750
Here it is increased by one for all the leaves, because we have

52:48.750 --> 52:50.670
introduced one more level.

52:51.050 --> 52:57.950
We would never introduce just a new inner node at the lower level, so

52:57.950 --> 53:03.010
we would not introduce something on that level to increase the path

53:03.010 --> 53:05.030
length in the tree.

53:05.390 --> 53:09.850
The only thing we would modify having an effect on the path length is

53:09.850 --> 53:14.370
by introducing a new root, by splitting up the current root if it

53:14.370 --> 53:20.450
becomes too populated, or if we would delete keys from that structure,

53:20.450 --> 53:27.030
we would maybe delete a root and merge the two descendants into one

53:27.030 --> 53:35.810
new node, and in that way reduce the path length in that structure.

53:36.310 --> 53:40.290
And that means we always have the same path length.

53:40.950 --> 53:45.050
Now the number of nodes or number of keys in the nodes can be

53:45.050 --> 53:51.090
different, so what we actually have to do is, when we are going into

53:51.090 --> 53:55.930
such a node and look for the appropriate next link, we have to define

53:55.930 --> 53:57.010
the appropriate position.

53:57.570 --> 54:01.530
For that you would have some ordered list of the keys in that node.

54:02.130 --> 54:07.110
To find the correct position is some kind of binary search, or you

54:07.110 --> 54:10.390
could have an internal search structure in there, but can always be

54:10.390 --> 54:17.210
done in logarithm of the number of elements in that node.

54:17.930 --> 54:24.950
And so, since you have the path length is locked to the base of m of

54:24.950 --> 54:30.610
n, inside each node you have logarithm of m, and so that is the

54:30.610 --> 54:32.910
maximum number of operations.

54:33.730 --> 54:43.150
And so the overall time is actually logarithm of n for the time for

54:43.150 --> 54:46.290
searching, inserting, or deleting.

54:47.390 --> 54:49.550
I don't know why I have an e there.

54:50.670 --> 54:51.570
I have to delete that.

54:52.170 --> 54:52.750
Do you have a question?

54:58.280 --> 55:00.040
What do you mean by optimal?

55:05.500 --> 55:10.500
Well, it may be that it's not optimal, not necessarily optimal,

55:10.700 --> 55:15.900
because it may be that there would be several different possibilities

55:15.900 --> 55:20.180
to actually build such a tree.

55:20.940 --> 55:24.860
You could have many large nodes, or you could have fewer...

55:26.180 --> 55:31.980
no, you could have few large nodes, or it could also happen that you

55:31.980 --> 55:38.820
have just the same number of keys in the document, but spread over

55:38.820 --> 55:41.600
more levels of a larger tree.

55:42.000 --> 55:42.680
It may happen.

55:43.520 --> 55:50.480
You certainly could always try to do it in a way that you always

55:50.480 --> 55:52.360
maximize the size of the nodes.

55:52.660 --> 55:55.600
But the question is when it is actually optimal.

55:56.060 --> 56:01.980
If optimality refers to number of disk accesses, you should have

56:01.980 --> 56:09.680
always the maximum number of elements in each node to always have the

56:09.680 --> 56:11.500
smallest path length.

56:12.400 --> 56:14.960
But it's not completely like that.

56:15.040 --> 56:19.040
You have some variants because you have between m over 2 and m

56:19.040 --> 56:23.880
elements in every node, but it's not that much of a factor.

56:24.760 --> 56:36.080
Yeah, it can extend the number of accesses slightly, but depending on

56:36.080 --> 56:42.000
your sequence of operations, insertions and deletions, it may be that

56:42.000 --> 56:49.380
you get some thinning out in one part of, like, for example, if you

56:49.380 --> 56:56.180
have a very large tree and you reduce, you have a very large tree, if

56:56.180 --> 57:01.240
this is a very large tree, for example, and you have built it up in a

57:01.240 --> 57:07.340
very concise way, or in an optimal way such that the path length

57:07.340 --> 57:14.340
really is small in all of them, you have very large, or many keys in

57:14.340 --> 57:18.240
all the nodes, and this is the same situation on both sides, and

57:18.240 --> 57:23.020
afterwards you have some deletions, for example, which only occur in

57:23.020 --> 57:27.120
the left part and not in the right part, then you might have a

57:27.120 --> 57:32.040
situation where you have only quite a few nodes which have few

57:32.040 --> 57:42.300
elements, still at least M over 2, but still less keys than the ones

57:42.300 --> 57:48.400
in the right-hand side, and if you would reshuffle the complete

57:48.400 --> 57:53.260
structure, you could balance that out again and you could reduce maybe

57:53.260 --> 58:00.320
the size of the tree, or the length of the path slightly, but not

58:00.320 --> 58:07.300
essentially, because it would just be a factor of 2 at most, and so

58:07.300 --> 58:14.580
you could achieve a reduction of 1 in the path length, so it's not

58:14.580 --> 58:22.500
that far away from what you have here in the balanced situation.

58:23.440 --> 58:25.480
Yeah, I hope it was correct.

58:28.100 --> 58:29.760
No, not half the tree size.

58:30.380 --> 58:33.080
You cannot half the tree size, because you would...

58:40.860 --> 58:41.300
Yes.

58:44.200 --> 58:48.640
You have M over 2, you have M over 2, M over 2 elements, so it can be

58:48.640 --> 58:53.960
that you have you have in every node here, then you would have here,

58:55.720 --> 59:01.020
you could reduce that.

59:10.430 --> 59:17.930
But so if you have a complete tree, if you have a tree where every

59:17.930 --> 59:23.370
node has M elements, let's just say they have M minus 1, doesn't

59:23.370 --> 59:28.130
matter, M elements, now you would like to have, you have a densely

59:28.130 --> 59:31.050
populated tree would have M elements.

59:32.550 --> 59:37.190
A tree, like so, if you have a tree, I hope I'm arguing correctly,

59:37.790 --> 59:47.250
here is one tree, and here every node, every node here has the maximum

59:47.250 --> 59:49.530
size, maximum number of elements.

59:50.130 --> 59:54.610
Now you have a different, now you say, what about a situation where I

59:54.610 --> 01:00:04.290
have a tree where here those two are just very, or not very dense,

01:00:04.390 --> 01:00:06.250
only M over 2 elements in every node.

01:00:07.270 --> 01:00:13.330
Then you have half the number of elements, you just reduced it by a

01:00:13.330 --> 01:00:20.190
factor of 2, you put those into two subtrees.

01:00:23.230 --> 01:00:29.550
Now you could try to merge or to combine them such that you have,

01:00:29.770 --> 01:00:36.890
again, M in every tree, but then you would have the, you would just

01:00:36.890 --> 01:00:41.470
reduce the level by 1.

01:00:51.800 --> 01:00:59.320
I don't, you can half the number of nodes, but not the, you can reduce

01:00:59.320 --> 01:01:02.120
the number of nodes by 2, but not the path length.

01:01:02.820 --> 01:01:04.600
The path length is just reduced by 1.

01:01:04.980 --> 01:01:05.200
Yes?

01:01:06.160 --> 01:01:06.500
Okay.

01:01:07.480 --> 01:01:08.000
Good.

01:01:08.720 --> 01:01:14.700
So that means here, like with respect to the path length, we don't

01:01:14.700 --> 01:01:18.580
waste that much, even if we don't have the optimum path length, it's

01:01:18.580 --> 01:01:21.340
just a deviation at most 1, that's okay.

01:01:22.160 --> 01:01:22.240
Yeah?

01:01:23.600 --> 01:01:28.980
And so this is quite an efficient structure with respect to the number

01:01:28.980 --> 01:01:34.060
of accesses to disk, because if you like to look for some value, you

01:01:34.060 --> 01:01:39.460
would just try to get as much information from the disk as possible,

01:01:39.980 --> 01:01:42.820
and all the information you get should be relevant for your search,

01:01:43.000 --> 01:01:44.960
and this is relevant for research in this case.

01:01:45.720 --> 01:01:45.900
Okay.

01:01:46.320 --> 01:01:48.500
So this is the B-tree.

01:01:48.720 --> 01:01:50.940
I didn't want to spend that much time on that, but if you have a

01:01:50.940 --> 01:01:51.660
question, it's okay.

01:01:52.100 --> 01:01:58.020
That is, by the way, one effect of recording lectures and reducing the

01:01:58.020 --> 01:02:01.100
number of participants, those that remain in class are really

01:02:01.100 --> 01:02:04.320
interested, and so I would like to answer all the questions that you

01:02:04.320 --> 01:02:07.600
have, and I pay more attention to those who actually are sitting here,

01:02:08.120 --> 01:02:14.400
but that leads to the effect that I cover less material in the course,

01:02:14.480 --> 01:02:15.600
which is an effect for everybody.

01:02:16.340 --> 01:02:17.980
So this is some kind of a problem.

01:02:19.080 --> 01:02:19.240
Okay.

01:02:20.420 --> 01:02:24.840
Now I would like to tell you something about this other structure, the

01:02:24.840 --> 01:02:30.000
trees, which are related to information retrieval, where you

01:02:30.000 --> 01:02:36.860
specifically look at the retrieval of information that is composed of

01:02:36.860 --> 01:02:39.440
words, and you do here character-based comparisons.

01:02:40.560 --> 01:02:46.060
So here you are searching not for numbers, you are searching for

01:02:46.060 --> 01:02:54.420
character sequences, and here, for example, is a set of words that

01:02:54.420 --> 01:02:58.660
have to be looked at in this case, and the structure that you could

01:02:58.660 --> 01:03:03.440
use here... I don't know what's happening.

01:03:03.660 --> 01:03:05.380
It's doing something in the background.

01:03:06.000 --> 01:03:07.720
I hope everything is okay.

01:03:07.960 --> 01:03:12.700
Like, I make an experiment today that I'm recording with a video

01:03:12.700 --> 01:03:20.080
camera and also with the normal frame grabber, and maybe it's too time

01:03:20.080 --> 01:03:26.720
-consuming, because somehow it does not... you see, all of a sudden

01:03:26.720 --> 01:03:27.780
it's all there.

01:03:28.080 --> 01:03:31.120
I didn't want to show you everything at the same time.

01:03:31.420 --> 01:03:33.260
So what did I want to show you?

01:03:34.260 --> 01:03:39.100
We have a set of words that we would like to incorporate into a tree

01:03:39.100 --> 01:03:39.500
structure.

01:03:40.020 --> 01:03:41.240
Now we can do that.

01:03:42.660 --> 01:03:45.520
Essentially, it is again the same as what I said with respect to

01:03:45.520 --> 01:03:46.120
pattern matching.

01:03:46.120 --> 01:03:51.600
We look for several words at the same time, and we have some document

01:03:51.600 --> 01:03:52.700
where those words occur.

01:03:53.300 --> 01:03:57.320
Now, how can we store such a set of words?

01:03:57.900 --> 01:04:03.860
We just look at the words in their normal sequences.

01:04:04.040 --> 01:04:07.500
Now, all of a sudden, I don't have that pen anymore.

01:04:08.160 --> 01:04:08.940
This is interesting.

01:04:09.420 --> 01:04:10.700
Now it's back in black.

01:04:11.580 --> 01:04:17.480
So the first symbol of the words, S or W, already leads to a decision

01:04:17.480 --> 01:04:19.420
on which word could be meant.

01:04:19.620 --> 01:04:23.520
So if you have a word starting with an S, and it's a word out of that

01:04:23.520 --> 01:04:29.360
set here, then it can only be the word... so if it's starting with an

01:04:29.360 --> 01:04:31.220
S, it can only be the word wind.

01:04:32.020 --> 01:04:36.900
And if we have the other words here, starting with a W, you would have

01:04:36.900 --> 01:04:38.440
to look at the second symbol.

01:04:38.920 --> 01:04:44.600
The second symbol is an E, an I, or an O, and so you can go on, and

01:04:44.600 --> 01:04:49.700
after the second symbol, you can even still have some uncertainty, so

01:04:49.700 --> 01:04:54.400
it may continue with an I or an R, and then you have all the words of

01:04:54.400 --> 01:04:57.920
that set put into one free structure.

01:04:58.750 --> 01:05:05.920
And now the idea of actually storing the information that is contained

01:05:05.920 --> 01:05:19.820
in a document as a tree is to store all the semi-infinite strings of a

01:05:19.820 --> 01:05:20.600
character sequence.

01:05:21.020 --> 01:05:23.800
What are the semi-infinite or semi-infinite strings?

01:05:24.720 --> 01:05:28.640
These are suffixes of the word.

01:05:28.860 --> 01:05:29.920
So you would start...

01:05:29.920 --> 01:05:32.340
the first suffix would be everything.

01:05:32.800 --> 01:05:35.360
The next suffix would be that one.

01:05:35.700 --> 01:05:37.460
The next would be that one.

01:05:37.580 --> 01:05:39.260
The next would be that one.

01:05:39.900 --> 01:05:44.360
So you have... they certainly would all have the same length,

01:05:44.480 --> 01:05:46.080
definitely, because they are all suffixes.

01:05:46.700 --> 01:05:50.760
So you just look at any suffix of the word, and you would like to

01:05:50.760 --> 01:05:57.220
store all the suffixes of a sequence of characters in some tree

01:05:57.220 --> 01:06:02.620
structure, which leads to a structure which is called a suffix tree.

01:06:03.940 --> 01:06:08.960
And in that way, you get for this example of just a sequence of zeros

01:06:08.960 --> 01:06:12.500
and ones, you would get to a structure looking like this.

01:06:13.020 --> 01:06:21.080
You have, like, the first suffix starts with a zero, and then the next

01:06:21.080 --> 01:06:22.180
one starts with a one.

01:06:22.760 --> 01:06:29.480
So if you would have, if you would look for some suffix starting with

01:06:29.480 --> 01:06:35.540
a one, that means you go to a specific position and look for the

01:06:35.540 --> 01:06:41.460
suffix starting at that position, then you would see, okay, that

01:06:41.460 --> 01:06:46.180
starts with a one, and now you look for the next symbol, and after

01:06:46.180 --> 01:06:51.100
that, that one, there is, there can be either a zero or a one.

01:06:51.160 --> 01:06:52.680
In that case, it would be a zero.

01:06:53.200 --> 01:06:57.360
In this, if you would have been there, you have a suffix having a one,

01:06:57.660 --> 01:06:58.520
a second symbol.

01:06:58.920 --> 01:07:01.900
So that would be, and it's the only situation where that occurs if I

01:07:01.900 --> 01:07:03.820
just look for the first eight suffixes.

01:07:04.700 --> 01:07:08.940
So then, if you'd have a sequence of one, one, you know immediately

01:07:08.940 --> 01:07:11.740
that you are at position two of that document.

01:07:13.340 --> 01:07:16.220
This sequence one, one occurs at position two.

01:07:16.820 --> 01:07:21.480
The sequence one, zero, zero, zero occurs at position six.

01:07:23.220 --> 01:07:30.420
Yeah, one, there it is, one, zero, zero, one, zero, zero, zero.

01:07:30.700 --> 01:07:34.480
Yes, exactly, three zeros at position six, and so on.

01:07:35.120 --> 01:07:40.220
And so, here you have a way of storing the information that is

01:07:40.220 --> 01:07:44.980
contained in your sequence of characters in a so-called suffix tree.

01:07:46.060 --> 01:07:53.880
And when you look for a certain word or certain suffix, you would just

01:07:53.880 --> 01:08:00.660
go into that suffix tree and go down until you actually get to some

01:08:00.660 --> 01:08:05.160
leaf where you have a decision that is the position where that suffix

01:08:05.160 --> 01:08:05.840
occurs.

01:08:07.200 --> 01:08:09.340
This also tells you the following.

01:08:10.260 --> 01:08:17.520
If you have a suffix starting with zero, one, you have two positions

01:08:17.520 --> 01:08:22.060
where that suffix occurs, positions five and one.

01:08:24.720 --> 01:08:33.140
So, if you look for suffixes starting with a one, they occur at just

01:08:33.140 --> 01:08:34.240
three positions.

01:08:36.060 --> 01:08:40.020
Yeah, there are three positions out of those eight that I have stored

01:08:40.020 --> 01:08:42.360
here that start with a one.

01:08:43.060 --> 01:08:46.780
It's that position, that position, and that position.

01:08:48.360 --> 01:08:52.260
The other positions, the other ones here are at higher positions as

01:08:52.260 --> 01:08:52.780
eight.

01:08:53.400 --> 01:08:56.160
Therefore, they are not recorded in that structure.

01:08:57.220 --> 01:09:00.440
But in this way, you have interesting information.

01:09:01.200 --> 01:09:07.100
If you just look at some partial path from the root to the leaf, and

01:09:07.100 --> 01:09:13.740
you look at the, for example, like this path here, to that node, you

01:09:13.740 --> 01:09:19.320
just have, or you are looking for a certain word inside a document,

01:09:19.940 --> 01:09:24.920
and you have immediately, if you look at the number of leaves that are

01:09:24.920 --> 01:09:30.420
below that, you have the number of occurrences of that word that you

01:09:30.420 --> 01:09:32.960
have looked at, the word zero one, in your document.

01:09:34.000 --> 01:09:39.500
So, a simple way of answering a question, how often does a certain

01:09:39.500 --> 01:09:41.820
word occur in a document?

01:09:42.840 --> 01:09:46.000
It's stored in the suffix tree in a very simple way, and certainly you

01:09:46.000 --> 01:09:52.320
can always, like you could always record the number of leaves that are

01:09:52.320 --> 01:09:55.020
in the subtree below in each node.

01:09:55.720 --> 01:10:00.560
It's just one additional value which can be stored there while you are

01:10:00.560 --> 01:10:02.480
constructing the tree, that's no problem.

01:10:04.980 --> 01:10:09.120
Okay, these different semi-infinite strings are identified at a

01:10:09.120 --> 01:10:11.900
starting position, that's obvious, and then you have here this

01:10:11.900 --> 01:10:12.920
information.

01:10:13.280 --> 01:10:16.740
On the starting position, you have locations for these different

01:10:16.740 --> 01:10:17.100
words.

01:10:17.240 --> 01:10:18.500
So it's an efficient structure.

01:10:19.160 --> 01:10:23.000
It can be even made more efficient because of the following.

01:10:23.600 --> 01:10:29.760
What we have here, for example, there, we have a sequence of nodes

01:10:29.760 --> 01:10:33.820
where we don't actually have a decision to make.

01:10:35.140 --> 01:10:39.140
Here, we see that after a zero, we always have a zero.

01:10:40.260 --> 01:10:43.480
So it doesn't make sense, actually, to look for those two symbols.

01:10:43.640 --> 01:10:48.820
It would be more reasonable to just go there immediately and look at

01:10:48.820 --> 01:10:57.080
the first position, and then if there is, like if there is a, like the

01:10:57.080 --> 01:11:00.460
first one, then you decide whether the next symbol is a zero or a one,

01:11:01.300 --> 01:11:07.380
and then you just look whether the symbol after the next is a zero or

01:11:07.380 --> 01:11:11.820
a one, because you know if you have one zero, the must come is zero.

01:11:12.160 --> 01:11:16.420
Other symbols, other sequences are not in that document.

01:11:16.980 --> 01:11:22.060
And so you could, in your structure, you could account for that by

01:11:22.060 --> 01:11:24.480
reducing or by deleting that node.

01:11:25.060 --> 01:11:26.720
The same would occur here.

01:11:26.980 --> 01:11:32.200
You would delete that node because after zero, zero, one, you always

01:11:32.200 --> 01:11:36.620
have a zero in those suffixes that are stored there, and so you don't

01:11:36.620 --> 01:11:37.640
need that extra node.

01:11:38.840 --> 01:11:44.020
And so you can, this is just what I indicated here with animations, so

01:11:44.020 --> 01:11:49.800
you have those, if I cut those out again, here you have those

01:11:49.800 --> 01:11:54.080
shortcuts which could be put into the structure.

01:11:55.780 --> 01:12:00.200
And so you just delete redundant comparisons, you delete non-branching

01:12:00.200 --> 01:12:04.760
paths, reduce them to shorter paths, and so you have a structure like

01:12:04.760 --> 01:12:11.820
this, but now you have to indicate that in the nodes in some way.

01:12:12.480 --> 01:12:15.580
There are two possible ways of indicating what we have done here.

01:12:16.080 --> 01:12:24.720
One way is to always have a pointer in the node on the position of the

01:12:24.720 --> 01:12:28.960
word that you're actually comparing to find whether it occurs in the

01:12:28.960 --> 01:12:29.320
document.

01:12:30.480 --> 01:12:34.720
And so here you would look at the first symbol, there at the second

01:12:34.720 --> 01:12:39.200
symbol, and here you only look at the fourth symbol, you ignore the

01:12:39.200 --> 01:12:39.800
third symbol.

01:12:41.080 --> 01:12:45.240
That would be one way of storing that, another way will be indicated

01:12:45.240 --> 01:12:46.220
on the next slide.

01:12:47.000 --> 01:12:51.540
So here, but certainly you need additional information, you can delete

01:12:51.540 --> 01:12:59.000
certain things, nodes where you don't have a branch actually, and so

01:12:59.000 --> 01:13:02.540
that would lead to a more efficient structure, but you certainly have

01:13:02.540 --> 01:13:07.120
to indicate now what is the position in your pattern that you have to

01:13:07.120 --> 01:13:12.380
look at in order to find out whether it occurs in the document.

01:13:13.260 --> 01:13:17.980
And these non-redundant trees are also called Patricia trees, and

01:13:17.980 --> 01:13:22.540
Patricia certainly has been the name of girlfriend or daughter or

01:13:22.540 --> 01:13:27.080
whatever of the person who invented that, but certainly he has hidden

01:13:27.080 --> 01:13:31.240
that by giving just the perfect abbreviation for that.

01:13:31.660 --> 01:13:35.880
It's a practical algorithm to retrieve information coded in for

01:13:35.880 --> 01:13:36.340
numerical.

01:13:37.480 --> 01:13:38.200
That's Patricia.

01:13:38.540 --> 01:13:42.300
Everybody would immediately see when he sees the word Patricia, it's

01:13:42.300 --> 01:13:43.680
an abbreviation for that word.

01:13:44.440 --> 01:13:51.400
So it is something which had been actually designed for these specific

01:13:51.400 --> 01:13:57.500
purposes of finding patterns in a word, and it actually has been

01:13:57.500 --> 01:14:03.640
designed by a person called Garnier at the University of Waterloo, who

01:14:03.640 --> 01:14:05.160
actually just gave a talk last week.

01:14:07.040 --> 01:14:12.400
And these Patricia trees have been designed in a larger project where

01:14:12.400 --> 01:14:19.240
they had to deal with retrieving words from longer texts such that you

01:14:19.240 --> 01:14:24.820
can later on easily make queries on that large document.

01:14:25.760 --> 01:14:33.460
So a PAT tree now is a Patricia tree containing all the suffixes of

01:14:33.460 --> 01:14:38.020
the text, and in particular what we do is we don't start at every

01:14:38.020 --> 01:14:43.440
position in the document, but we only start at reasonable positions at

01:14:43.440 --> 01:14:44.840
the beginnings of a word.

01:14:45.480 --> 01:14:49.540
So now we come back to our question, how can we separate a document

01:14:49.540 --> 01:14:50.520
into words?

01:14:51.280 --> 01:14:56.220
And so this is done by, we know how we do that, and now we have the

01:14:56.220 --> 01:15:03.480
suffix trees, and we just build a suffix tree looking at the initial,

01:15:04.280 --> 01:15:08.840
like the start, every reasonable position, reasonable positions of

01:15:10.360 --> 01:15:15.940
suffixes are just those positions here, for example, in this simple or

01:15:15.940 --> 01:15:17.640
a bit strange phrase.

01:15:18.360 --> 01:15:22.380
If you would like to look for the occurrences of words in that

01:15:22.380 --> 01:15:26.900
document, a very short string, how would that be stored?

01:15:27.440 --> 01:15:33.300
That would be some possibility to store this information on the

01:15:33.300 --> 01:15:40.040
suffixes in that document, suffixes only those that start at the

01:15:40.040 --> 01:15:41.620
starting positions at words.

01:15:43.240 --> 01:15:48.860
Then if you look at all the words here, they start either with a W or

01:15:48.860 --> 01:15:54.640
with a P, so just by looking at the first symbol, we only need P and

01:15:54.640 --> 01:16:01.720
W, or if you have a very long text, you will have more branches here,

01:16:02.080 --> 01:16:06.520
because you would get a successor for every first symbol of a word,

01:16:07.320 --> 01:16:11.180
there may be many different symbols, but here we only have two

01:16:11.180 --> 01:16:14.200
different symbols, otherwise it wouldn't have fit really on this

01:16:14.200 --> 01:16:14.540
slide.

01:16:15.960 --> 01:16:23.660
And then if we have a P, we see there are two occurrences of something

01:16:23.660 --> 01:16:28.700
starting with a P, it is Peter followed by a blank, or Peter followed

01:16:28.700 --> 01:16:29.640
by an S.

01:16:30.640 --> 01:16:34.360
And this difference only occurs at position six.

01:16:35.940 --> 01:16:39.520
And so this is the reduced version, we just look at position six

01:16:39.520 --> 01:16:46.160
there, if the next symbol is a blank or an S, and if it is a blank, we

01:16:46.160 --> 01:16:50.100
actually start at position 13, and if it is an S, we are at position

01:16:50.100 --> 01:16:50.560
19.

01:16:52.320 --> 01:16:55.980
And in the same way, we have to analyze those words that are the

01:16:55.980 --> 01:17:02.460
suffixes that come or that start with a W, and you see that here we

01:17:02.460 --> 01:17:05.280
actually cannot reduce anything, we have to look at successive

01:17:05.280 --> 01:17:09.800
positions, but you only have to look at the first three symbols of

01:17:09.800 --> 01:17:14.720
these suffixes, and then we already know that these are the suffixes

01:17:14.720 --> 01:17:18.320
starting at 5, 1, 10, or 28.

01:17:19.620 --> 01:17:27.200
So this is a way of producing such a PATH tree, where we only store

01:17:27.200 --> 01:17:34.640
those suffixes starting at and now it is a redundant tree, and there's

01:17:34.640 --> 01:17:36.760
one thing which is actually left out.

01:17:37.780 --> 01:17:45.960
Assume you take this tree structure here, and you are looking for the

01:17:45.960 --> 01:17:47.700
word PATHR.

01:17:48.860 --> 01:17:53.140
If you're looking for the word PATHR, you would see, okay, the first

01:17:53.140 --> 01:17:58.440
one is a P, and after that, after 6, maybe there's a blank, you would

01:17:58.440 --> 01:18:00.200
say it occurs at position 13.

01:18:00.740 --> 01:18:03.540
But at position 13, you have not PATHR, but Peter.

01:18:04.680 --> 01:18:06.520
So something has been lost.

01:18:07.680 --> 01:18:12.740
And this is taken care of in this way, and the other way of storing

01:18:12.740 --> 01:18:19.720
this, you don't indicate in the nodes the current position that you

01:18:19.720 --> 01:18:27.700
have to check, but you indicate as a label of the edge the word that

01:18:27.700 --> 01:18:33.100
is actually leading to the next decision position to distinguish

01:18:33.100 --> 01:18:38.540
between different suffixes, but you record there the complete string

01:18:38.540 --> 01:18:42.440
that is in between those two positions that had to be compared.

01:18:42.680 --> 01:18:46.980
So if you store here the word Peter, you would immediately notice, if

01:18:46.980 --> 01:18:49.500
you look for PATHR, that there has been a difference.

01:18:50.500 --> 01:18:57.080
So in this way, you would have still a reduction of the tree size, but

01:18:57.080 --> 01:19:02.680
certainly, now you have to store for every link a sequence of strings

01:19:02.680 --> 01:19:09.720
that indicates the symbols where you didn't have actually differences

01:19:09.720 --> 01:19:16.100
in the suffixes, but if you look later on for the occurrence of words

01:19:16.100 --> 01:19:19.280
in a document, you need that information on the intermediate

01:19:19.280 --> 01:19:23.840
positions, whether they actually occur in the pattern that you look

01:19:23.840 --> 01:19:24.080
for.

01:19:24.760 --> 01:19:30.000
So if you have this structure, now you can easily search for that.

01:19:30.400 --> 01:19:34.960
You could look for all the occurrences of Peter and you see it's two

01:19:34.960 --> 01:19:35.420
positions.

01:19:35.780 --> 01:19:40.720
If you look for all occurrences of suffixes starting with a W, you

01:19:40.720 --> 01:19:42.740
would say, okay, there are four such positions.

01:19:42.800 --> 01:19:49.460
If you look for all the occurrences of the word we, it's twice there,

01:19:49.640 --> 01:19:50.180
and so on.

01:19:50.420 --> 01:19:55.980
But certainly, we, in this case, would not be there with we followed

01:19:55.980 --> 01:20:01.400
by a blank, not as a separate word, but just as a prefix of via and

01:20:01.400 --> 01:20:01.660
vice.

01:20:01.780 --> 01:20:06.240
So if you would like to look for the word we, you would have to look

01:20:06.240 --> 01:20:09.900
for we and a blank afterwards, which does not occur here.

01:20:11.520 --> 01:20:17.660
Okay, so this just shows you how we can produce these search

01:20:17.660 --> 01:20:22.080
structures, and then the question is how we can actually utilize that.

01:20:22.920 --> 01:20:27.940
So if you search in a pet tree, if you have some word that you are

01:20:27.940 --> 01:20:35.160
looking for, and then you just, what you actually do is to determine

01:20:35.160 --> 01:20:38.160
the starting point for pattern matching.

01:20:39.820 --> 01:20:44.440
If the initial, if the suffix starts with the pattern that you are

01:20:44.440 --> 01:20:49.140
looking for, then you can look, does it really occur at that position?

01:20:49.280 --> 01:20:54.620
And if it's not, like the first mismatch on the sequence of

01:20:54.620 --> 01:20:59.840
comparisons with the letters in the suffix and your pattern would

01:20:59.840 --> 01:21:01.100
indicate that it's not there.

01:21:02.600 --> 01:21:07.680
Then you could compare, like you compare the search term and the

01:21:07.680 --> 01:21:08.160
suffix.

01:21:12.420 --> 01:21:19.600
So if, yeah, if only individual letters are used as labels, then you

01:21:19.600 --> 01:21:22.080
have to actually have to compare the search term and the suffix,

01:21:22.200 --> 01:21:27.380
otherwise you would compare the search term or you would have that in

01:21:27.380 --> 01:21:30.940
the structure already, you don't have to go into the document.

01:21:32.520 --> 01:21:37.160
And if you would like to insert something, so we have just discussed

01:21:37.160 --> 01:21:39.620
those things here on the previous slide.

01:21:39.940 --> 01:21:44.640
If you would like to insert something into the pet tree, you do that

01:21:44.640 --> 01:21:46.780
in a similar way as in search tree.

01:21:46.780 --> 01:21:58.080
So I, on the next slide, I give you an example of this string, strange

01:21:58.080 --> 01:21:59.520
document here.

01:22:00.840 --> 01:22:07.180
Just an abbreviated example, assume every symbol here is the starting

01:22:07.180 --> 01:22:07.820
of a word.

01:22:07.900 --> 01:22:11.180
I said in the pet tree, you would only, you would look at all the

01:22:11.180 --> 01:22:14.500
suffixes starting at beginnings of words.

01:22:15.080 --> 01:22:18.840
So just assume that these symbols here are the starting positions of

01:22:18.840 --> 01:22:26.420
words, and then you would have all these suffixes here, and if they

01:22:26.420 --> 01:22:30.040
are stored in a suffix tree, you get this.

01:22:30.320 --> 01:22:37.580
If it's reduced, here we have actually redundant information.

01:22:38.100 --> 01:22:43.660
We have the string in between the first and the fourth position, A, B,

01:22:43.740 --> 01:22:47.460
B, and also the indication here that we're looking at the fourth

01:22:47.460 --> 01:22:47.900
position.

01:22:48.660 --> 01:22:52.060
So you can use either one of those.

01:22:52.220 --> 01:22:58.000
Either you have the strings as labels of the edges, or you have the

01:22:58.000 --> 01:23:00.440
positions in the nodes.

01:23:02.580 --> 01:23:09.380
But this just gives you, for this example, better information where we

01:23:09.380 --> 01:23:14.820
are currently looking at, and just indicates the kind of tree you get

01:23:14.820 --> 01:23:16.780
as a result of that word.

01:23:17.300 --> 01:23:24.180
So here we have to continue and look at other questions.

01:23:24.740 --> 01:23:25.260
Frequency.

01:23:26.660 --> 01:23:29.680
I mentioned that already, you would like to see how often does a

01:23:29.680 --> 01:23:30.520
certain word occur.

01:23:31.040 --> 01:23:34.820
You just look at the number of leaves in the corresponding subtree

01:23:34.820 --> 01:23:37.740
below that position where your search ended.

01:23:39.060 --> 01:23:43.020
If you look for the most frequent word in the text, you just search

01:23:43.020 --> 01:23:47.880
all the paths from the root to the next blank, and determine the

01:23:47.880 --> 01:23:54.940
sequence from the root to the next blank means you have here,

01:23:55.300 --> 01:24:00.740
somewhere into this blank, the word on the sequence of those edges

01:24:00.740 --> 01:24:06.040
would be some word that is actually one word, because it's separated

01:24:06.040 --> 01:24:08.240
by an initial blank and a final blank.

01:24:09.140 --> 01:24:18.140
And then you just look for the word having the largest subtree below.

01:24:19.920 --> 01:24:23.380
You look for the first occurrence of a blank in your suffix tree, and

01:24:23.380 --> 01:24:28.900
then look for the largest subtree that you have below those blanks.

01:24:29.360 --> 01:24:36.440
Okay, it's already past the normal finishing time, so I'd like to stop

01:24:36.440 --> 01:24:37.760
this lecture.

01:24:38.120 --> 01:24:38.920
Thanks for your attention.

01:24:39.620 --> 01:24:41.620
See you again in two weeks.

01:24:41.820 --> 01:24:43.440
No, not two weeks, three weeks.

01:24:43.880 --> 01:24:49.580
Next week I have to be at a meeting at Berlin because of important

01:24:50.640 --> 01:24:56.460
projects that we arranged in one of those knowledge and innovation

01:24:56.460 --> 01:24:58.440
communities that are currently going on.

01:24:59.320 --> 01:25:01.520
And in two weeks I have to be at a...

