WEBVTT

00:00.920 --> 00:02.100
Good morning.

00:02.720 --> 00:06.380
Welcome to another session of Algorithms for Internet Applications.

00:07.380 --> 00:16.420
The first slide that I show you this morning is a slide on some, not

00:16.420 --> 00:21.060
commercial, but some event by Daimler-Chrysler.

00:21.060 --> 00:23.260
Daimler-Chrysler switched to German.

00:25.940 --> 00:33.060
In cooperation with the university, Daimler-Chrysler organized a

00:33.060 --> 00:34.180
career roadshow.

00:34.380 --> 00:35.620
I hope I can stay in English.

00:36.400 --> 00:43.860
Career roadshow, recruiting, technical talent program, everything in

00:43.860 --> 00:44.160
English.

00:46.000 --> 00:48.600
Daimler -Chrysler is looking for young engineers.

00:49.420 --> 00:52.240
Economic engineers are also young engineers.

00:53.780 --> 01:00.000
At a roadshow on the 7th of December, in one week, from 10 to 5 p.m.

01:00.160 --> 01:07.820
So, right after this lecture, you can listen to something about

01:07.820 --> 01:10.460
possibilities to work at Daimler-Chrysler.

01:11.260 --> 01:16.640
Dr. Bartke will tell you something about what is being worked on

01:16.640 --> 01:16.780
there.

01:20.080 --> 01:21.600
Okay, that's it.

01:25.940 --> 01:29.540
It's always important to inform yourself in time about what is

01:29.540 --> 01:30.320
possible on the job market.

01:32.360 --> 01:41.460
We switch to English again and look again at the topics that we

01:41.460 --> 01:44.020
addressed last week.

01:44.760 --> 01:48.120
We looked at algorithms for searching for information.

01:48.680 --> 02:00.280
I showed you something about searching for similar words, like

02:00.280 --> 02:01.360
similarity search.

02:01.500 --> 02:04.980
We looked at metrics for determining the distance between words.

02:06.100 --> 02:12.660
We looked at simple algorithms for pattern matching in texts.

02:12.660 --> 02:19.380
You have a query and look for the occurrence of patterns of some words

02:19.380 --> 02:20.040
in text.

02:20.680 --> 02:28.280
One algorithm, like the initial naive algorithm, had a worst-case

02:28.280 --> 02:31.920
complexity of order of n times m, where n is the length of the text

02:31.920 --> 02:33.680
and m is the length of the pattern.

02:34.660 --> 02:39.160
Then we looked at the ideas that Knuth, Morris, and Pratt had some

02:39.160 --> 02:43.980
time ago about pattern matching, where you exploit the knowledge that

02:43.980 --> 02:48.700
you acquired while scanning through the pattern and the text and

02:48.700 --> 02:51.900
determine an appropriate shifting distance.

02:52.220 --> 02:56.940
The shifting distance that was computed here was actually not the

02:56.940 --> 03:01.440
shifting distance, but the position of the next symbol of the pattern

03:01.440 --> 03:05.120
that should be moved to the current position in the document.

03:05.840 --> 03:09.540
So we had to pre-process our pattern in order to determine reasonable

03:09.540 --> 03:10.360
shift distances.

03:11.060 --> 03:18.780
That was the thing that was visualized in this figure here and

03:18.780 --> 03:24.940
formulated formally in this definition of this function nextj.

03:25.340 --> 03:27.600
We looked at ways to compute that.

03:27.600 --> 03:33.040
The main idea was that whenever we move the pattern by some distance,

03:33.160 --> 03:36.980
we know that some part of the pattern might overlap.

03:37.720 --> 03:41.180
Since we know everything about the pattern beforehand, we can exploit

03:41.180 --> 03:43.400
this knowledge.

03:44.080 --> 03:52.540
Therefore, we look for appropriate positions such that the prefix of

03:52.540 --> 03:57.840
the pattern is a suffix of the part of the pattern to the left of the

03:57.840 --> 03:59.880
current position in the text.

04:00.900 --> 04:04.040
This was the main idea of Knuth, Morris and Pratt.

04:04.700 --> 04:10.620
Then we also looked at the algorithm by Boyer and Moore.

04:11.260 --> 04:15.160
Boyer and Moore had a similar idea, but looked at it in a different

04:15.160 --> 04:15.600
way.

04:16.960 --> 04:22.340
Actually, they didn't look from left to right, but from right to left.

04:22.340 --> 04:28.100
They compared the pattern backwards with the document, started with

04:28.100 --> 04:33.880
the rightmost position, and then scanned the pattern to the left.

04:35.760 --> 04:41.380
In that way, the shifting distance was computed in a different way.

04:44.000 --> 04:49.740
When we check our pattern from right to left, we know something about

04:49.740 --> 04:51.720
the suffix of our pattern.

04:52.400 --> 04:56.800
Therefore, whenever we move the pattern to the right, we have to

04:56.800 --> 04:57.980
exploit that knowledge.

04:58.960 --> 05:05.260
Therefore, we developed a heuristic for doing that.

05:05.780 --> 05:07.460
Actually, we developed two heuristics.

05:07.840 --> 05:12.460
One was the occurrence heuristic, looking for occurrences of the

05:12.460 --> 05:16.940
current pattern in the document within the pattern.

05:16.940 --> 05:20.060
That was the occurrence heuristic.

05:20.580 --> 05:26.780
The other heuristic was our match heuristic.

05:28.380 --> 05:40.020
I think I should start at this position again in our presentation.

05:41.020 --> 05:47.700
This was our definition of this match heuristic.

05:49.720 --> 05:53.380
As you can see here again...

05:53.380 --> 05:58.020
Let me just select the correct...

05:59.340 --> 06:00.660
It's again black.

06:01.260 --> 06:02.240
Strange enough.

06:02.240 --> 06:09.120
In this course, it's always a different color than in the other

06:09.120 --> 06:09.540
course.

06:10.120 --> 06:10.860
Very strange.

06:12.780 --> 06:15.220
Here we have this figure.

06:16.440 --> 06:22.300
In this figure here, we see that this suffix of our pattern that we

06:22.300 --> 06:31.300
have scanned through already certainly should reoccur as a prefix of

06:31.300 --> 06:35.220
the pattern to the right of the new position of the pattern that is

06:35.220 --> 06:37.560
positioned at that place.

06:38.520 --> 06:42.600
These two red areas should be identical.

06:43.240 --> 06:48.640
Otherwise, it wouldn't be a reasonable shifting of the pattern because

06:48.640 --> 06:52.360
there would be mismatches that we know of beforehand.

06:54.160 --> 07:01.140
There was a simple example, banana, where this word ana reoccurs here

07:01.140 --> 07:04.380
with a different letter to the left of it.

07:04.460 --> 07:06.260
Here there is an N, there is a B.

07:09.200 --> 07:12.460
This was an example of a periodic pattern.

07:15.800 --> 07:20.000
Here we have a way of computing shifting distances.

07:20.000 --> 07:28.260
I mentioned that sometimes some confusion might arise by statements in

07:28.260 --> 07:28.660
the literature.

07:29.300 --> 07:32.660
There is a statement in a book that I recommended.

07:35.320 --> 07:39.100
Otherwise, this book is really good, but here is a statement which I

07:39.100 --> 07:41.060
just cannot follow.

07:41.060 --> 07:49.780
They state that this match heuristic is not effective since periodic

07:49.780 --> 07:50.700
words are rare.

07:51.360 --> 07:56.900
But if you look at the definition, what happens if there is no such

07:56.900 --> 07:58.660
periodicity in the word?

07:58.820 --> 08:03.080
This is a periodicity because some part of the pattern, the suffix,

08:03.240 --> 08:05.340
reoccurs at some other position in the pattern.

08:06.140 --> 08:10.540
This is something which we would call some kind of periodicity because

08:10.540 --> 08:13.320
a certain part reoccurs within the pattern.

08:14.060 --> 08:17.920
But what if this suffix does not reoccur at all?

08:19.100 --> 08:21.800
What would be the value of S?

08:22.240 --> 08:24.880
S is the shifting distance.

08:24.880 --> 08:32.680
So J minus S is the new position of the pattern that will be placed at

08:32.680 --> 08:37.500
this previous position here, which I indicated here by Di.

08:39.660 --> 08:47.620
Do you have an idea what S would be if there is no such reoccurring

08:47.620 --> 08:48.300
suffix?

08:48.840 --> 08:50.980
This suffix does not reoccur in the pattern.

08:52.880 --> 08:53.400
Nadir?

08:54.700 --> 08:54.960
Yeah?

08:58.520 --> 08:59.440
How many positions?

09:01.720 --> 09:06.940
To the leftmost position, to this one here.

09:07.620 --> 09:10.220
To take that one here, this position.

09:15.680 --> 09:19.160
But if it would shift, but if it would place this position here, this

09:19.160 --> 09:27.560
initial leftmost position underneath the Di, then, and you know that

09:27.560 --> 09:34.300
this suffix doesn't reoccur here, then this wouldn't be shifted far

09:34.300 --> 09:34.640
enough.

09:36.900 --> 09:37.500
Yeah?

09:38.120 --> 09:41.940
Because you actually know that this suffix does not reoccur.

09:43.000 --> 09:44.080
So this was a good guess.

09:44.160 --> 09:48.100
It is a large distance, but the distance can actually be larger than

09:48.100 --> 09:48.560
the J.

09:48.560 --> 09:55.120
And you see that in here, in this part of our definition.

09:56.500 --> 10:02.580
Here we have the requirement that all the letters to the right of J

10:02.580 --> 10:07.100
should reoccur starting from position S.

10:08.100 --> 10:15.040
And if those letters don't reoccur, well then, S has to be larger than

10:15.040 --> 10:18.920
K for all the K in this range here.

10:19.800 --> 10:24.200
If it has to be larger than K for all the K in this range, it means

10:24.200 --> 10:27.300
the value must be M.

10:29.340 --> 10:29.460
Yeah?

10:30.390 --> 10:32.400
Like S greater or equal to K.

10:33.840 --> 10:40.540
To make this true, S must be greater or equal to K, or the letters to

10:40.540 --> 10:42.540
the right should be the same.

10:43.320 --> 10:48.240
So in this case, if the suffix does not reoccur, the shifting distance

10:48.240 --> 10:49.440
S is actually M.

10:50.920 --> 10:53.820
And that is the maximum possible shifting distance.

10:53.820 --> 10:58.800
Which is what we actually would like to get, because in that case we

10:58.800 --> 11:05.560
actually don't look again at places in our document.

11:05.740 --> 11:12.180
Actually, maybe we scan only part of our document, or compare only

11:12.180 --> 11:13.640
part of our document to the pattern.

11:13.820 --> 11:17.860
Because we can determine by just looking at a few positions that the

11:17.860 --> 11:22.160
pattern is not occurring in this document.

11:22.160 --> 11:28.320
Okay, so the opposite of what was stated there is true.

11:28.720 --> 11:33.740
For aperiodic words, we get the most from that heuristic.

11:34.700 --> 11:39.820
Okay, what about the preprocessing time for computing these heuristic

11:39.820 --> 11:40.520
values?

11:40.520 --> 11:46.320
When we look at the tables delta, delta was the table for computing

11:46.320 --> 11:52.780
the distances where a certain letter from the alphabet occurs in the

11:52.780 --> 11:53.520
pattern.

11:54.120 --> 11:56.000
So we computed this table delta.

11:56.280 --> 11:58.460
There was a variance, delta prime.

11:58.960 --> 12:04.900
We also considered the position in the pattern.

12:05.940 --> 12:11.760
And then we have here this sigma h table, which we have to compute.

12:13.120 --> 12:20.160
And this can be done, as is easily seen really, by just-in-time order

12:20.160 --> 12:20.960
of m plus a.

12:21.040 --> 12:27.040
We just have to check through our pattern and list all the occurrences

12:27.040 --> 12:28.560
of symbols.

12:28.560 --> 12:33.240
The first occurrence is of symbols scanning from the right to the

12:33.240 --> 12:33.500
left.

12:33.620 --> 12:35.680
Then we have our tables delta and delta prime.

12:36.960 --> 12:48.320
And sigma h is also easily computed, so it's no really large effort to

12:48.320 --> 12:50.180
compute these heuristics.

12:51.020 --> 12:54.560
And as soon as you have these tables, you can start looking for a

12:54.560 --> 12:57.080
pattern in any document.

12:57.080 --> 13:02.200
There are improved versions of the Boyamur algorithm, which lead to

13:02.200 --> 13:06.880
even better behavior on the average.

13:07.140 --> 13:13.780
So the average performance here is sublinear, actually order of n over

13:13.780 --> 13:15.260
m, which is quite good.

13:16.160 --> 13:20.900
And there actually are variants of the Boyamur algorithm where you

13:20.900 --> 13:29.360
don't have to use this scanning from right to left, but you can also

13:29.360 --> 13:32.960
scan from left to right and use the same ideas and get the same

13:32.960 --> 13:33.880
sublinear performance.

13:34.060 --> 13:36.920
So there are variants of that, but we don't have to look into all of

13:36.920 --> 13:37.760
these things.

13:37.760 --> 13:41.680
One interesting problem is that usually you don't want to search just

13:41.680 --> 13:45.840
for one term within a document, but for several terms or for several

13:45.840 --> 13:46.320
patterns.

13:47.240 --> 13:51.500
And in that case, what you certainly could do in a naive approach

13:51.500 --> 13:59.880
would be to just perform k searches for k terms, for k words or

13:59.880 --> 14:00.300
patterns.

14:00.300 --> 14:07.420
But if you are a bit more clever, you would just search concurrently

14:07.420 --> 14:08.480
for all these patterns.

14:08.760 --> 14:14.580
It may be that there are some coincidences of characters within these

14:14.580 --> 14:20.640
patterns, and then you can use this information in order to build a

14:20.640 --> 14:26.580
more sophisticated algorithm or automaton to perform this combined

14:26.580 --> 14:27.840
pattern matching.

14:29.280 --> 14:35.480
And certainly this leads to larger tables or to more sophisticated

14:35.480 --> 14:42.520
algorithms, but the time for scanning through the document is almost

14:42.520 --> 14:45.600
the same time as for looking up a single pattern.

14:45.740 --> 14:51.420
So it's just looking at different paths in that algorithm.

14:51.420 --> 14:56.900
There are several versions of such algorithms.

14:57.020 --> 14:58.260
Yeah, there's a question.

15:02.600 --> 15:05.940
WRT means with respect to.

15:06.760 --> 15:08.180
That's a standard abbreviation.

15:08.260 --> 15:14.640
There are some standard abbreviations, so WRT means with respect to.

15:14.640 --> 15:21.720
There are other abbreviations like IFF, if and only if, and things

15:21.720 --> 15:22.280
like that.

15:23.900 --> 15:25.500
Okay, good that you asked.

15:27.120 --> 15:29.700
So there is an algorithm by Ayo and Korazic.

15:30.580 --> 15:35.500
Actually, it appeared before the algorithm of Knuth, Morris and Pratt,

15:35.640 --> 15:40.720
so previous versions of the Knuth, Morris, Pratt algorithm were known

15:40.720 --> 15:42.160
earlier than 77.

15:42.800 --> 15:50.580
And then there is an algorithm by Komenswalter who combined ideas of

15:50.580 --> 15:52.320
Ayo, Korazic and Boyamur.

15:52.680 --> 15:57.400
So this combined approach by Ayo and Korazic used the Knuth, Morris,

15:57.520 --> 16:02.620
Pratt ideas and Komenswalter combined these ideas with the, or this

16:02.620 --> 16:06.200
combined approach with the Boyamur algorithm.

16:06.200 --> 16:13.240
Okay, so there is a lot of literature on that, but we don't have to

16:13.240 --> 16:15.960
investigate that further.

16:16.180 --> 16:19.220
This is just one part of looking at full-text search.

16:20.820 --> 16:27.160
The approach that we took here was to analyze the pattern, build that

16:27.160 --> 16:33.560
table, or build the basis for that pattern matching, and then we could

16:33.560 --> 16:37.300
search arbitrary documents as fast as possible, like in sub-linear

16:37.300 --> 16:39.300
time, which really is fast.

16:40.100 --> 16:44.260
A different approach would be to analyze our documents to allow for an

16:44.260 --> 16:46.120
efficient search for arbitrary patterns.

16:47.140 --> 16:52.960
So this certainly is something which is the reasonable choice if you

16:52.960 --> 16:56.960
want to build up a large database of documents and allow for arbitrary

16:56.960 --> 16:57.460
searches.

16:57.460 --> 17:04.440
So what you do there is, as I indicated initially in this chapter, you

17:04.440 --> 17:09.460
start by looking into your document and you divide your document into

17:09.460 --> 17:09.920
words.

17:10.580 --> 17:14.180
Now, if you want to divide a document into words, you have to do

17:14.180 --> 17:18.360
something like lexical analysis and compiling, or you do something

17:18.360 --> 17:20.480
like string tokenizing in Java.

17:20.700 --> 17:23.340
There are special procedures or special methods for that.

17:23.340 --> 17:27.960
And the question is, what is actually a word?

17:28.080 --> 17:31.440
What defines the notion of a word?

17:32.260 --> 17:38.880
Just by looking at this slide, we see all kinds of words, and for us,

17:39.060 --> 17:45.720
words are something like here, patterns, or search, or for, and we

17:45.720 --> 17:52.160
notice that that's a word because we look at the separation between

17:52.160 --> 17:54.260
those strings of characters.

17:54.460 --> 17:59.040
So we see the blank spaces, and these blank spaces look like

17:59.040 --> 18:05.100
separators, and everything in between blank spaces is assumed to be a

18:05.100 --> 18:05.400
word.

18:06.520 --> 18:12.960
Now, that's formalized to have something like a string without special

18:12.960 --> 18:13.420
characters.

18:13.420 --> 18:18.180
So without a colon, without a blank, or without a hyphen.

18:20.820 --> 18:27.380
Now, these are reasonable definitions, but what do we do with

18:27.380 --> 18:29.860
something like names with hyphens?

18:30.060 --> 18:31.680
Jean-Claude, or state of the art?

18:31.760 --> 18:37.480
State of the art is that a collection of four words?

18:38.080 --> 18:39.260
Or is that one word?

18:39.540 --> 18:41.420
Or is MS-DOS?

18:41.420 --> 18:42.760
Well, that's not really a word.

18:44.720 --> 18:49.080
But something like this here, Arbeitsrecht, something that is

18:49.080 --> 18:58.160
separated by this separation mark here, just because it didn't fit

18:58.160 --> 18:58.860
into this line.

18:59.480 --> 19:01.080
But it should be the word Arbeitsrecht.

19:01.620 --> 19:04.280
Or what about this, chapter 2.3?

19:04.280 --> 19:12.100
Is that a word chapter, followed by a word 2, followed by a word 3?

19:12.700 --> 19:15.340
Or is that one word, chapter 2.3?

19:16.840 --> 19:22.180
So you see that it is not that simple to actually find out what the

19:22.180 --> 19:24.120
words in a document are.

19:24.860 --> 19:28.740
And there's even more things.

19:28.900 --> 19:32.660
Usually you would say that words shouldn't contain digits, because

19:32.660 --> 19:35.020
that would be some other numbers.

19:35.540 --> 19:37.580
So BS-2000, for example.

19:39.860 --> 19:40.840
Is that a word?

19:41.600 --> 19:44.360
Who knows what BS-2000 stands for?

19:46.700 --> 19:50.940
None of you knows what BS-2000 stands for.

19:51.820 --> 19:57.180
Shame on one of our largest computer companies in Germany.

19:58.060 --> 20:04.360
Siemens had for years, and still has, one of its major operating

20:04.360 --> 20:06.240
systems is BS-2000.

20:06.360 --> 20:07.780
Betriebssystem 2000.

20:08.780 --> 20:16.140
That was the futuristic operating system which came into life in the

20:16.140 --> 20:17.440
80s someplace.

20:18.420 --> 20:21.120
And survived decades, I would say.

20:22.260 --> 20:27.800
It's built after the operating system from IBM.

20:28.980 --> 20:32.240
Major operating system there for mainframe architecture.

20:32.500 --> 20:34.240
So it's a mainframe operating system.

20:37.280 --> 20:40.580
But you don't know what BS-2000 is.

20:40.620 --> 20:41.320
It doesn't matter.

20:41.700 --> 20:46.500
Probably you know what F-16 is, but it's not as important, I would

20:46.500 --> 20:46.720
say.

20:47.320 --> 20:49.600
F -16, is that a word?

20:49.920 --> 20:51.460
Or what about H-20?

20:53.040 --> 20:54.420
Is that a word?

20:55.260 --> 21:00.500
So you see, it's difficult to determine what a word is.

21:00.500 --> 21:03.660
So sometimes we would say these are reasonable words, sometimes we

21:03.660 --> 21:04.240
would say no.

21:05.640 --> 21:09.580
Then the question is, should we look at capital or small letters?

21:09.760 --> 21:13.340
Should we distinguish between capital and small letters?

21:13.540 --> 21:16.460
Should we treat them as being the same?

21:17.820 --> 21:21.260
Sometimes this might be important, whether you have capital letters or

21:21.260 --> 21:21.480
not.

21:21.640 --> 21:23.240
Sometimes it is completely irrelevant.

21:23.240 --> 21:29.260
If you formulate a query and you write a term in there with a starting

21:29.260 --> 21:35.800
capital letter, then would you actually like to look only for words

21:35.800 --> 21:40.160
having exactly the spelling with a capital letter?

21:40.300 --> 21:44.840
Or would you also look for occurrences of patterns having instead a

21:44.840 --> 21:45.780
small letter in there?

21:45.780 --> 21:53.260
So, again, something which you have to consider when you separate a

21:53.260 --> 21:54.580
document into words.

21:55.240 --> 22:00.220
You do this because later on you want to determine whether certain

22:00.220 --> 22:01.960
words occur in your document.

22:02.780 --> 22:09.480
And if you have not extracted the reasonable listing of words, then

22:09.480 --> 22:13.860
you won't be able to find these words in your documents.

22:13.860 --> 22:14.800
Okay.

22:16.440 --> 22:20.840
Again, what you do is, in lexical analysis, you just construct a

22:20.840 --> 22:21.580
suitable automaton.

22:22.340 --> 22:23.140
This can be done.

22:23.300 --> 22:26.640
This separation into words is actually, again, a simple finite

22:26.640 --> 22:27.200
automaton.

22:29.160 --> 22:34.800
And the second step is to filter relevant words.

22:35.200 --> 22:38.000
So the question is, which words are relevant?

22:38.000 --> 22:44.060
Well, certainly not relevant with respect to the query, but relevant

22:44.060 --> 22:49.040
with respect to what is there, or can a certain word actually have any

22:49.040 --> 22:50.820
meaning with respect to queries?

22:51.460 --> 22:55.080
So you would never look for the word are.

22:55.220 --> 22:56.720
You would not look for on.

22:56.920 --> 23:01.080
You would not look for of, or eh, and so on.

23:01.080 --> 23:05.440
So there are certain words which we call stop list words.

23:06.160 --> 23:14.060
There's a list of words which you just can delete from your document,

23:14.240 --> 23:21.140
which you don't insert into your list of words that should be

23:21.140 --> 23:25.760
considered when you process queries later on.

23:26.600 --> 23:30.580
So this stop list certainly may be integrated into lexical analysis,

23:31.180 --> 23:38.200
into the separation algorithm, by appending this algorithm to the

23:38.200 --> 23:40.000
scanning algorithm.

23:41.020 --> 23:45.260
And, for example, if you would say you want to delete something like

23:45.260 --> 23:49.400
eh, or an, or and, or in, or into, and to, you could build up an

23:49.400 --> 23:56.320
automaton like this, where you see that whenever you have an a, or an

23:56.320 --> 24:09.720
n, or an and, or an in, or you have to, or you have into, actually,

24:10.240 --> 24:17.440
these would be, so into, for example, you would follow this path, or

24:17.440 --> 24:22.560
and, you would follow this path, and whenever you get to a final state

24:22.560 --> 24:26.380
indicated by a double circle, then you have a stop word.

24:26.380 --> 24:29.680
So this would be a simple automaton, which certainly would be

24:29.680 --> 24:37.680
formulated in some type of, some type of programming language, but

24:37.680 --> 24:40.040
essentially you do something like this.

24:41.060 --> 24:46.360
So such an automaton is easily constructed to see that finite automata

24:46.360 --> 24:52.060
are used in many situations in information processing, and you just

24:52.060 --> 24:58.380
have to, or to manage a number of different states, and you have to

24:58.380 --> 25:01.760
know how to get from one state to the other, that's essentially what

25:01.760 --> 25:02.540
you have to do there.

25:03.640 --> 25:08.580
Okay, that was separating the document, looking for stop words, or

25:08.580 --> 25:13.520
deleting stop words, and another point is to reduce the words to the

25:13.520 --> 25:16.560
word stems in order to allow for more general searches.

25:17.360 --> 25:24.940
So you certainly want to make sure that, for example, here, the word

25:24.940 --> 25:30.440
stemming, yeah, if you look for, or if you have a query later on which

25:30.440 --> 25:34.900
looks for stemming, it would be good if it would also find occurrences

25:34.900 --> 25:39.420
of the word, of words where you just have the word stem and not

25:39.420 --> 25:40.120
stemming.

25:40.760 --> 25:50.140
So what you reduce is certain suffixes that are really not that

25:50.140 --> 25:56.700
relevant, so the suffix i-e-s will be replaced by just the suffix i,

25:57.520 --> 26:02.720
as, for example, the word ponies would be reduced to the word pony, or

26:02.720 --> 26:08.760
the word, or if suffix s is just replaced with nothing, it's just

26:08.760 --> 26:15.660
deleted, so cats would be reduced to cat, or the e-d ending is also

26:15.660 --> 26:23.100
deleted, so plastered would be plaster, or i-n-g would be deleted, so

26:23.100 --> 26:32.200
motoring would be replaced with motor, and like et actually would be,

26:33.840 --> 26:39.920
so this would be a double step, a second step, so for example here,

26:40.100 --> 26:48.120
conflated would be reduced to conflat by this rule here, e-d is

26:48.120 --> 27:00.500
deleted, and then you would replace this ending a-t with the ending a

27:00.500 --> 27:06.200
-t -e here, and then you would get something from conflated, you would

27:06.200 --> 27:08.720
get to conflate, and so on.

27:08.940 --> 27:15.200
Long list of possible replacements, and definitely this is very

27:15.200 --> 27:19.940
language specific, you have to do this for different languages in

27:19.940 --> 27:26.080
different ways, but in that way you would get words which are some

27:26.080 --> 27:32.180
kind of, which are called the stems of the words, and so you could, if

27:32.180 --> 27:37.160
you later on look for something like form, or formative, you would get

27:37.160 --> 27:39.780
everything that is reduced to form.

27:40.860 --> 27:44.720
Maybe you don't want that, so not every search engine will provide

27:44.720 --> 27:50.060
that type of search, but if a search engine has this, then it can

27:50.060 --> 27:55.020
perform more general searches.

27:56.800 --> 28:01.720
So, another advantage is that the number of words you have to store is

28:01.720 --> 28:09.880
actually reduced quite significantly by at least 50%, or another point

28:09.880 --> 28:17.080
is that, what I just said, the search is generalized to include more

28:17.080 --> 28:21.680
words than are actually formulated in the query.

28:23.020 --> 28:26.640
The most important point, definitely, is to build up an efficient

28:26.640 --> 28:27.480
search structure.

28:28.100 --> 28:33.800
And this building up of an efficient search structure, well, certainly

28:33.800 --> 28:43.320
is most important, because every search later on will be executed on

28:43.320 --> 28:48.060
this structure, and the efficiency of these searches, therefore,

28:48.180 --> 28:50.120
depends on the structure you have here.

28:50.420 --> 28:53.480
What you have to do is you have to build an inverted file.

28:54.410 --> 29:01.280
Here, you have your file, you construct your index, and now you have

29:01.280 --> 29:06.340
here a listing of all the words, and for every word, you need pointers

29:06.340 --> 29:10.220
to the positions in the text where these words occurred.

29:10.860 --> 29:12.020
That's an inverted file.

29:12.020 --> 29:18.540
You have a listing of the contents of a file, and you need pointers to

29:18.540 --> 29:23.120
the positions in the original file where this content actually

29:23.120 --> 29:23.600
occurred.

29:24.700 --> 29:28.780
So, for every word of the document, you have to specify where it

29:28.780 --> 29:37.160
actually occurred, within which document, at which location, and you

29:37.160 --> 29:39.040
can certainly add more information.

29:39.280 --> 29:47.080
You can add something like relevance information, weight, there may be

29:47.080 --> 29:50.380
ways of determining the weight of a word.

29:51.060 --> 29:56.260
You could write down how often it occurs, would become some indication

29:56.260 --> 30:00.100
of the relevance, so if it occurs just once, it might be not as

30:00.100 --> 30:03.600
relevant as a word that occurs 20 or 100 times in a document.

30:05.600 --> 30:11.720
And there might be more things like distances from the beginning and

30:11.720 --> 30:13.720
things like that, although you have that immediately from the

30:13.720 --> 30:14.060
location.

30:15.840 --> 30:22.960
So, possible data structures are a great number of possible data

30:22.960 --> 30:23.320
structures.

30:23.320 --> 30:27.240
One simple data structure would be a sorted list.

30:28.500 --> 30:32.100
Just make an alphabetical listing of all your words.

30:32.320 --> 30:37.000
As you see it at the end of every book, of every reasonable book, you

30:37.000 --> 30:40.200
have an index at the end, an alphabetical list of all the words

30:40.200 --> 30:43.820
occurring in the document, in the book, and you have pointers

30:43.820 --> 30:49.720
backward, back to the locations, and so that's a sorted list with

30:49.720 --> 30:50.480
random access.

30:50.640 --> 30:55.960
You can look at any position in this list, would be some possible

30:55.960 --> 30:57.160
inverted file.

30:58.640 --> 31:04.220
The advantage of that is that you have a very simple structure, and

31:04.220 --> 31:06.700
you have logarithmic access time.

31:06.800 --> 31:11.560
If you want to look for a specific word within that sorted list, you

31:11.560 --> 31:15.680
just perform binary search in order to locate that word or find out

31:15.680 --> 31:17.400
whether that word actually is in the list.

31:18.220 --> 31:23.660
So, you have logarithmic time to actually get to the word that you

31:23.660 --> 31:24.280
look at.

31:25.200 --> 31:30.060
And the problem is, though, that if you want to maintain a sorted

31:30.060 --> 31:34.700
list, then you have quite expensive administration.

31:35.180 --> 31:39.600
You have linear time for insertions and deletions, because whenever

31:39.600 --> 31:44.500
you insert something into such a list, you have to move all the

31:44.500 --> 31:51.700
entries that come after that one position further, and that certainly

31:51.700 --> 31:53.980
is linear time for insertions.

31:54.160 --> 31:58.040
The same is true for deletions, and so that's an expensive data

31:58.040 --> 32:04.600
structure whenever you have dynamic information.

32:04.600 --> 32:13.280
Information about contents in the World Wide Web is information that

32:13.280 --> 32:19.660
is very dynamically changing, so probably a sorted list for the random

32:19.660 --> 32:22.460
access wouldn't be that reasonable.

32:23.520 --> 32:29.880
Then there is something, some of the standard choices if you look for

32:29.880 --> 32:32.900
efficient data structures, is a tree structure.

32:33.560 --> 32:42.220
There are many possible implementations of trees, or of inverted files

32:42.220 --> 32:42.960
as trees.

32:43.640 --> 32:49.500
One standard structure would be a B-tree, which is a balanced search

32:49.500 --> 32:58.540
tree having several entries per node, not just one entry per node, but

32:58.540 --> 33:07.420
usually quite a lot of entries, usually about half a memory page in

33:07.420 --> 33:08.780
one node.

33:09.660 --> 33:14.920
And this is especially useful for large search structures on hard

33:14.920 --> 33:15.280
disks.

33:15.480 --> 33:22.920
If you need secondary memory for storing your index, which is the case

33:22.920 --> 33:26.780
for very large search structures, and if we talk about search engines,

33:27.400 --> 33:33.740
we talk about very large search spaces, so here B-tree might be a

33:33.740 --> 33:35.320
reasonable data structure.

33:35.540 --> 33:40.540
What I didn't list here is something like hash tables.

33:41.480 --> 33:50.720
You all know from Informatics 2 or from other introductory courses to

33:50.720 --> 33:55.200
Informatics, something about hash tables would also be a reasonable

33:55.200 --> 34:04.900
data structure with constant access time on the average, but if it's

34:04.900 --> 34:10.680
very large or very heavily growing or shrinking, this might lead to

34:10.680 --> 34:10.940
problems.

34:11.100 --> 34:14.720
So hash tables actually are also used to some extent in search

34:14.720 --> 34:15.120
engines.

34:15.120 --> 34:18.600
What I want to concentrate on is tree structures.

34:19.240 --> 34:23.340
I want to start by telling you something about B-trees, and then I

34:23.340 --> 34:27.420
will show you something about so-called tris.

34:28.600 --> 34:37.140
Here, people are not really consistent in the way they pronounce this.

34:37.920 --> 34:41.420
I should...

34:42.000 --> 34:43.060
No, this doesn't work.

34:43.640 --> 34:44.700
This doesn't work.

34:45.280 --> 34:45.940
Oh, it does work.

34:46.220 --> 34:46.440
Perfect.

34:47.220 --> 34:52.200
So I just want to erase this.

34:54.080 --> 35:03.260
So tris or trees, this pronunciation is not a wrong pronunciation or

35:03.260 --> 35:05.140
wrong spelling, I mean.

35:05.440 --> 35:11.540
This spelling is not a wrong spelling of a tree, T-R-E-E, but this is

35:11.540 --> 35:15.520
some artificial word coming from the word retrieval.

35:15.520 --> 35:23.360
So it has to do with retrieving information and the similarity to the

35:23.360 --> 35:24.980
trees with E-E.

35:25.760 --> 35:27.640
Well, it's just used here.

35:28.480 --> 35:33.420
Okay, so some people pronounce it tri, some people pronounce it tree.

35:35.580 --> 35:41.200
It doesn't matter how you do it, there are some variants in the

35:41.200 --> 35:42.380
English language.

35:42.660 --> 35:44.520
Okay, let us start with B-trees.

35:44.520 --> 35:47.900
B here actually doesn't stand for balance, but for Bayer.

35:48.640 --> 35:57.500
Bayer is a German scientist who worked at Munich, and he designed

35:57.500 --> 35:58.660
these B-trees.

35:59.300 --> 36:05.100
And the B-trees are, as I said, a structure that is especially useful

36:05.100 --> 36:11.840
in very large search spaces where you use secondary memory to store

36:11.840 --> 36:12.420
your data.

36:13.380 --> 36:20.100
So we want to build a balance tree supporting an efficient search for

36:20.100 --> 36:22.460
keywords from an ordered set.

36:22.700 --> 36:28.160
That means we can compare two words and say one is larger than the

36:28.160 --> 36:29.240
other in some ordering.

36:30.260 --> 36:36.540
And in the B-tree, we have a certain number of properties.

36:36.540 --> 36:39.340
So a B-tree has a degree m.

36:40.460 --> 36:42.340
Here's an example for m equals 3.

36:44.280 --> 36:47.840
One important property is this balancing property.

36:48.060 --> 36:52.720
All the leaves of our B-tree have the same depth.

36:53.140 --> 36:58.800
That means the number of nodes on the path from the root to the leaf

36:58.800 --> 36:59.780
is always the same.

37:00.420 --> 37:01.980
That way it is balanced.

37:01.980 --> 37:06.800
Another point where it should be balanced is that the number of

37:06.800 --> 37:13.840
entries of keys in all these different sub-trees should be, to some

37:13.840 --> 37:18.140
part, the same or not too different.

37:19.720 --> 37:29.680
So every inner node are nodes that are not leaves or not a root.

37:30.400 --> 37:36.800
Every inner node except for the root has at least m over 2 successors.

37:36.940 --> 37:44.660
m over 2 in these upper square brackets is the next larger integer.

37:46.280 --> 37:52.340
So that means if we have 3, that means that every inner node has at

37:52.340 --> 37:54.040
least 2 successors.

37:54.040 --> 38:01.060
m over 2, 3 over 1 is 1.5, so it's 2 successors.

38:01.700 --> 38:03.020
At least 2 successors.

38:03.640 --> 38:05.460
The root has at least 2 successors.

38:05.540 --> 38:08.700
But, for example, if we would have a larger B-tree, as I said, usually

38:08.700 --> 38:12.640
we would have something like half a memory page size.

38:12.760 --> 38:20.600
So usually m would be something like 512 or so, or 511 or 13, and so

38:20.600 --> 38:20.800
on.

38:20.800 --> 38:28.080
And so then you would see that most of those nodes in the tree would

38:28.080 --> 38:34.040
have many successors, and only the root may have less than m over 2

38:34.040 --> 38:34.840
successors.

38:35.380 --> 38:37.700
The root has at least 2 successors.

38:38.020 --> 38:40.380
In this example you see that there are 3 successors.

38:40.380 --> 38:47.060
Every node has at most m successors, so in this example here for m

38:47.060 --> 38:52.300
equals 3, there will never be more than 3 successors to a node.

38:53.060 --> 38:58.160
And finally here we have every node with k successors contains k minus

38:58.160 --> 38:58.840
1 keys.

38:58.840 --> 39:07.120
So, let's fill this tree here with some reasonable content, so assume

39:07.120 --> 39:10.280
that we just store letters, not words.

39:10.540 --> 39:14.980
I wouldn't be able to write down many words in here, so I just store

39:14.980 --> 39:17.160
some letters, for example.

39:17.680 --> 39:22.000
Let's assume that every dot here indicates that there should be a key.

39:22.000 --> 39:27.960
Let's start with an A here, maybe we have... this could be just

39:27.960 --> 39:29.200
some...

39:29.200 --> 39:31.420
for example, a G here.

39:31.920 --> 39:41.100
We could have, let's say, a K over there, and maybe some R over here,

39:41.480 --> 39:45.340
maybe some... or could be some M.

39:45.340 --> 39:50.380
And here we could have a T and an X.

39:52.020 --> 39:59.000
So, this indicates the way we actually build up such a tree, so if you

39:59.000 --> 40:04.760
search for a certain letter, for example, if you would search for the

40:04.760 --> 40:12.080
letter G, then we go into the root node, we see G is less than K, so

40:12.080 --> 40:14.600
we have to follow this link here.

40:15.320 --> 40:19.380
Then we get to a node, we look whether G is in that node, this is the

40:19.380 --> 40:21.700
case, so we are done.

40:22.140 --> 40:25.920
If we would have looked for the occurrence of the letter D, we would

40:25.920 --> 40:33.380
also have entered this node here, and we would have seen, okay, D is

40:33.380 --> 40:38.980
larger than A, but the link immediately... okay, it's larger than A,

40:39.060 --> 40:43.080
it's smaller than G, so we would have to follow this link here, but

40:43.080 --> 40:48.900
this link leads to a leaf, which is empty, the D is not stored in this

40:48.900 --> 40:49.560
structure.

40:49.560 --> 40:54.480
If you want to insert something, for example, we want to insert the

40:54.480 --> 41:02.260
letter U into this structure, how would we do that?

41:03.160 --> 41:10.560
We would look into our root, there's no U in this root, so U is larger

41:10.560 --> 41:16.140
than R, we have to follow this link here, go to this node there, we

41:16.140 --> 41:23.420
have the letter T, U is larger than T, it is smaller than X, so we

41:23.420 --> 41:29.340
have to follow this link, there's a leaf, so U is not in the index, or

41:29.340 --> 41:32.520
not in this P-tree, we would have to insert U.

41:32.520 --> 41:42.460
If we insert U into this node here, we have three keys within that

41:42.460 --> 41:48.880
node, that means every node with K successors contains K-1 keys, here

41:48.880 --> 41:53.680
we have three keys, so we would need four successors, but four

41:53.680 --> 41:59.180
successors is not allowed, because we have property B4, which states

41:59.180 --> 42:01.560
that we have at most M successors.

42:02.040 --> 42:04.720
So, here we have to do something.

42:05.860 --> 42:12.920
What we do is, we just split that node, and move the key that is in

42:12.920 --> 42:16.040
the middle position up to the top.

42:16.040 --> 42:17.000
So what do we do?

42:17.220 --> 42:30.920
We delete this U, and we just move, well, U, for a short moment, we

42:30.920 --> 42:39.220
construct this node U, having two successors, one successor, another

42:39.220 --> 42:45.300
successor, and this successor here has, here we have the node

42:45.300 --> 42:52.740
containing just one key T, and one additional leaf we need here, and

42:52.740 --> 43:00.380
another node would be just this X, having two successor leaves.

43:00.380 --> 43:08.500
But certainly this U cannot be just here, just sit here, just besides

43:08.500 --> 43:09.500
the structure.

43:09.700 --> 43:15.220
So the U actually is moved up, so it's moved up, it is moved into this

43:15.220 --> 43:21.280
next node on the path to the root.

43:21.280 --> 43:26.740
So the U actually does not sit there, but the U has to be inserted

43:26.740 --> 43:31.240
into this node that is the parent node of the node where we actually

43:31.240 --> 43:32.000
inserted U.

43:32.000 --> 43:41.380
So this node that we moved out here now, which contains the median

43:41.380 --> 43:48.180
value of these keys here in that node, this U is inserted into the

43:48.180 --> 43:51.480
node, like this parent node.

43:51.480 --> 43:55.740
Now in there, U is larger than R, so U is entered at that position.

43:56.320 --> 44:07.500
Again, we have a situation, so this here again has to be deleted.

44:07.500 --> 44:09.440
Oops, doesn't work.

44:10.400 --> 44:20.580
So what we have here now is one, this R would lead, oops, this was

44:20.580 --> 44:26.780
wrong, this R would, well we need something different here.

44:26.780 --> 44:33.940
So we have now these two nodes, T and X, having two successors.

44:35.120 --> 44:39.380
Here we have a node which has two keys, which is too much, so we have

44:39.380 --> 44:46.560
to actually write R here on top, take the median value of those keys,

44:46.700 --> 44:48.780
we delete our R again.

44:48.780 --> 44:57.320
Then we have our new root R, we have a new node containing just the K,

44:58.260 --> 45:02.380
and a link to this M.

45:02.940 --> 45:10.740
We have the R, which has a link to this node here with a U.

45:12.540 --> 45:16.680
And a link to this K.

45:17.620 --> 45:20.540
And where does this U lead to?

45:20.920 --> 45:24.520
Certainly to that node and to that node.

45:24.520 --> 45:29.240
I hope you can see a bit of the structure.

45:29.400 --> 45:38.140
So what we did here is we inserted a new entry into this tree, and

45:38.140 --> 45:50.260
this resulted in a splitting of a large node at the lowest level above

45:50.260 --> 45:51.240
the leaves.

45:51.240 --> 45:59.280
We had to split that node, that led to a splitting of the parent node,

45:59.380 --> 46:01.740
because that was large already.

46:02.500 --> 46:07.580
And so we got a new root node, that means here actually the depth of

46:07.580 --> 46:09.400
the tree was increased by one.

46:09.400 --> 46:14.520
So here we see how we actually insert into a B-tree.

46:15.500 --> 46:22.020
In a similar way, we would delete entries from such a B-tree, and what

46:22.020 --> 46:27.680
you notice if you look at these algorithms is that the depth of the

46:27.680 --> 46:29.780
tree is always...

46:29.780 --> 46:34.260
the properties here, B1 to B5, are always maintained.

46:34.400 --> 46:37.940
That means we always have all the leaves have the same depth.

46:38.860 --> 46:42.820
It will never happen that this tree gets unbalanced.

46:42.820 --> 46:48.780
Certainly it is in some way slightly unbalanced because there will be

46:48.780 --> 46:55.160
some nodes containing many keys and some nodes containing fewer keys,

46:55.860 --> 47:04.340
but because of this property here, we know that every inner node has

47:04.340 --> 47:08.660
at least about half the maximum number of nodes.

47:08.660 --> 47:15.640
So the number of keys in a node is something in between M half and M,

47:16.100 --> 47:19.720
or M half minus one and M minus one.

47:21.120 --> 47:24.140
And why do we actually do things like that?

47:24.340 --> 47:29.020
Why do you put more than one entry, more than one key into a node?

47:29.980 --> 47:37.880
The reason for that is that accessing a node means, in this setting,

47:38.620 --> 47:41.880
you have access to secondary memory.

47:42.580 --> 47:46.080
If you access secondary memory, it means that it needs time.

47:46.080 --> 47:55.060
Access to hard disk is much slower than access to main memory or

47:55.060 --> 47:56.000
access to cache.

47:56.980 --> 48:02.440
And if you have to move a new page into main memory, it takes so long

48:02.440 --> 48:07.400
that you want to minimize the number of these accesses.

48:07.400 --> 48:15.160
And therefore, you put about one page of data into one node.

48:15.260 --> 48:21.380
That means when you have to access one of those items, you always get

48:21.380 --> 48:25.900
a complete page with reasonable information from your secondary

48:25.900 --> 48:26.400
memory.

48:26.400 --> 48:31.600
And that means the number of disk accesses in order to find actually

48:31.600 --> 48:35.760
an entry in this search structure is minimized.

48:36.220 --> 48:39.980
That's the major purpose of this data structure.

48:39.980 --> 48:46.740
So B-trees are one of the standard data structures in the design of

48:46.740 --> 48:52.860
large data, of large search structures, of large indices.

48:53.960 --> 48:56.600
But it's not the only thing that you look at.

48:56.600 --> 49:02.620
What you have to do is you also have to optimize the data structures

49:02.620 --> 49:07.940
that are actually maintained in main memory, and so we have to look at

49:07.940 --> 49:08.740
a few more things.

49:08.740 --> 49:13.140
So here, what you actually get is that you can search, insert, and

49:13.140 --> 49:19.880
delete in time order of lock M of N, where N is the number of keys,

49:20.100 --> 49:23.200
and M here is the maximum degree of the nodes.

49:23.200 --> 49:33.000
So essentially, logarithmic access time, and we saw that in the sorted

49:33.000 --> 49:38.040
lists, we had also logarithmic time for searching, but we had linear

49:38.040 --> 49:39.940
time for inserting and deleting.

49:39.940 --> 49:48.580
So that was much worse, but the actual search time is similar.

49:49.500 --> 49:55.840
But in a sorted list, you cannot be sure that you have the same small

49:55.840 --> 50:01.100
number of accesses to a disk as you have in a B-tree.

50:01.100 --> 50:07.340
So, as I said, in practice, M is about half the memory page size, in

50:07.340 --> 50:11.520
order to optimize the number of accesses to secondary memory.

50:11.520 --> 50:18.020
And, okay, this is definite, all the keys of a node fit into one page

50:18.020 --> 50:25.220
of memory, and now we get to this different structure, trees or tries,

50:25.820 --> 50:29.700
whatever you want to pronounce this, data structure for the efficient

50:29.700 --> 50:34.660
retrieval of words by character-based comparisons.

50:35.260 --> 50:35.820
Okay.

50:39.460 --> 50:45.440
Actually, what we have, when we talked about this looking for several

50:45.440 --> 50:50.240
words at the same time, this algorithm by Aho and Korazic, which I

50:50.240 --> 50:54.540
didn't present to you in detail, or not even gave you any detail,

50:55.120 --> 50:58.480
there, what you actually build up is something which is called also a

50:58.480 --> 50:58.860
try.

50:58.860 --> 51:05.900
And here, this try is a tree with a branching degree that is less than

51:05.900 --> 51:11.100
or equal to the size of the alphabet, because the branching occurs

51:11.100 --> 51:15.680
with respect to the occurrence of characters in a word.

51:16.880 --> 51:23.760
So, simple example, symbols that are some words that look quite

51:23.760 --> 51:30.040
similar, and if you want to build up a search structure for this set

51:30.040 --> 51:35.440
of words, so assume we look for a set of words, could also be a

51:35.440 --> 51:38.720
sequence of words, but here we just interpret this as a set of words,

51:38.720 --> 51:40.900
wer, weiss, wo, wir, sind.

51:41.260 --> 51:44.220
Now, how could you build a search structure for that?

51:44.980 --> 51:50.660
You look, like you start at a root, and you look at the first

51:50.660 --> 51:52.040
character of the words.

51:52.140 --> 51:57.320
The first character of the words are either a W or an S.

51:57.320 --> 52:04.480
If you detect that the first character is an S, and you are only

52:04.480 --> 52:09.720
looking for words within that set here, then you are done, because you

52:09.720 --> 52:17.920
know that, so if it is an S, actually I could have deleted that note

52:17.920 --> 52:23.180
here, don't need that note, you know immediately that it is definitely

52:23.180 --> 52:26.440
the word sind, or it could be the word sind.

52:27.320 --> 52:31.200
I could compare with that word, because you know the only word

52:31.200 --> 52:34.740
starting with an S in that list here is the word sind.

52:35.640 --> 52:38.820
If it starts with a W, there are several choices.

52:39.140 --> 52:45.480
The second letter of the words that are in this structure are either E

52:45.480 --> 52:47.460
or I or O.

52:47.460 --> 52:54.440
If it is an O, you already have one word, namely wo, which occurs in

52:54.440 --> 52:56.140
our set.

52:56.620 --> 53:02.840
If it is an I, you know there is only one word having W followed by an

53:02.840 --> 53:04.120
I, and so on.

53:04.120 --> 53:09.920
And so you get this very simple search structure for the words in this

53:09.920 --> 53:17.540
set, which is a structure reasonable for this character-based

53:17.540 --> 53:18.080
comparison.

53:18.420 --> 53:26.700
You compare your query term, or you scan your query term, and you know

53:26.700 --> 53:29.800
something about the structure of your document.

53:29.800 --> 53:33.820
The structure of your document is actually incorporated into that

53:33.820 --> 53:37.220
simple tree or trie.

53:38.220 --> 53:49.800
Something which is usually done is to look at all semi-infinite

53:49.800 --> 53:52.400
strings, the so-called systrings.

53:52.400 --> 54:01.180
That means, if you have such a sequence, like you could look at this

54:01.180 --> 54:06.640
sentence here, at this phrase, all semi-infinite strings, systrings,

54:06.760 --> 54:09.980
suffixes of a character sequence, and you look for occurrences of

54:09.980 --> 54:17.820
words in that sequence of letters, you could also look for, well, all

54:17.820 --> 54:23.700
words starting with all, all words starting with infinite, or all

54:23.700 --> 54:28.620
words starting with suffixes, or with of, or a, and so on.

54:29.540 --> 54:34.980
And in that way, you would look at the beginnings of words, at the

54:34.980 --> 54:40.780
starting positions of words within your document, and you don't worry

54:40.780 --> 54:46.320
that much about the things that follow those starting characters,

54:46.320 --> 54:54.060
because in your matching process, you want to compare letters as long

54:54.060 --> 54:58.900
as it is not clear which position, actually, you are looking at.

54:59.160 --> 55:01.880
As soon as you know the position that you have to look at in your

55:01.880 --> 55:06.980
document, you will actually compare the words, and then you have

55:06.980 --> 55:11.460
looked, essentially, at this semi-infinite string, a suffix of a

55:11.460 --> 55:12.240
character sequence.

55:12.240 --> 55:13.580
So that's what we do.

55:14.240 --> 55:20.440
So, for example, if we would look for the first eight suffixes of this

55:20.440 --> 55:27.420
simple sequence of zeros and ones, we could build up a search tree, or

55:27.420 --> 55:31.760
a retrieval tree, like this, here.

55:31.760 --> 55:36.080
The first suffix would be just 01100, and so on.

55:37.880 --> 55:46.360
But what we have here is a structure leading to leaves, at those

55:46.360 --> 55:51.780
positions where it is definitely clear where this semi-infinite string

55:51.780 --> 55:52.680
actually started.

55:52.680 --> 55:59.080
So the letters, or the digits, or these numbers, these numbers here in

55:59.080 --> 56:04.480
the leaves, indicate the starting positions of these semi-infinite

56:04.480 --> 56:10.580
strings, or suffixes, within that sequence.

56:10.580 --> 56:19.480
So here it says 1, because the sequence 011, like among those first

56:19.480 --> 56:27.140
eight strings, 1, 2, 3, 4, 5, 6, 7, 8, so we only go to this position.

56:27.140 --> 56:36.520
So if you look at the possible starting or prefixes of sequences, you

56:36.520 --> 56:41.620
see that 011 occurs only once, namely at this initial position.

56:41.620 --> 56:57.160
Or here it says 8, because 00101 is only one sequence starting with

56:57.160 --> 57:04.800
that sequence, so 00101 starts at position 8, 00101.

57:04.800 --> 57:12.760
There are other sequences starting with 0, like those at 5 and 1, or

57:12.760 --> 57:18.620
00s, those starting at position 7, 4, and so on.

57:19.620 --> 57:25.440
So you scan through the term that you're actually interested in, until

57:25.440 --> 57:31.780
you finally get to a position where you can determine the location of

57:31.780 --> 57:34.940
that query term within your document.

57:34.940 --> 57:42.800
That location is indicated as an entry in the leaf of that tree.

57:43.320 --> 57:49.420
So if you have in your query term the sequence 1000, you know that

57:49.420 --> 57:55.660
that query term matches at position 6 here, 1000.

57:55.660 --> 57:57.940
So that would start at that position.

57:58.680 --> 58:01.440
Then you can do whatever you like with that information.

58:02.320 --> 58:13.100
So this would lead to the construction of trees, just by looking at

58:13.100 --> 58:18.600
the structure that is embedded in such a sequence of characters.

58:19.460 --> 58:25.300
Now you could, so definitely here the different systrings are

58:25.300 --> 58:28.420
identified by the starting position, that's obvious now.

58:28.900 --> 58:31.440
Now you could try to improve what you have done here.

58:32.700 --> 58:38.280
So you should try to make this as small as possible, you should delete

58:38.280 --> 58:39.560
redundant comparisons.

58:39.560 --> 58:41.560
What are redundant comparisons?

58:42.220 --> 58:46.960
For example, here you have a non-branching path.

58:47.760 --> 58:54.160
Here you have a comparison of 1 and 0, and you know there's no

58:54.160 --> 58:55.000
branching here.

58:55.000 --> 58:58.260
There's only one possible continuation.

58:59.100 --> 59:08.400
If you have 001, then all the systrings in your sequence will continue

59:08.400 --> 59:09.120
with a 0.

59:10.100 --> 59:14.600
But that means if you have detected the 1, you could immediately go

59:14.600 --> 59:19.680
through that position and check whether the next position then

59:19.680 --> 59:21.160
contains a 0 or 1.

59:21.160 --> 59:23.000
The same is true here.

59:23.640 --> 59:29.180
If you have a starting with a 1, and you know that the next symbol is

59:29.180 --> 59:35.540
a 0, then you should not actually stop there, but you should go to the

59:35.540 --> 59:38.660
next position and check whether there is a difference.

59:38.660 --> 59:45.720
So you could delete certain nodes from your structure,

59:50.920 --> 01:00:00.280
and that way you would get a reduced data structure, but you certainly

01:00:00.280 --> 01:00:04.100
have to remember what you have done.

01:00:04.100 --> 01:00:12.440
And you remember that by indicating in your nodes the position of the

01:00:12.440 --> 01:00:15.440
query term that you actually look at.

01:00:15.880 --> 01:00:19.860
So, the initial position you look at is 1 here.

01:00:19.860 --> 01:00:25.660
The next position, for example, here, if you have seen a 0, would be

01:00:25.660 --> 01:00:26.340
position 2.

01:00:27.100 --> 01:00:30.580
The next position then, in both cases, would be a 3.

01:00:31.620 --> 01:00:35.860
But here, if you have seen a 1 in that case, the next position in your

01:00:35.860 --> 01:00:39.000
query term would be the position 5.

01:00:39.000 --> 01:00:47.200
Because position 4 is not relevant for determining the location of the

01:00:47.200 --> 01:00:48.320
term within the document.

01:00:49.100 --> 01:00:50.320
The same is true here.

01:00:50.900 --> 01:00:56.540
There you would scan positions 1, 2, and then, after that, position 4.

01:00:56.540 --> 01:00:59.960
So you need a bit more information in your data structure.

01:01:00.180 --> 01:01:03.720
You need the information on the current position in your term that you

01:01:03.720 --> 01:01:04.580
have to look at.

01:01:05.340 --> 01:01:13.140
But you gain something because the structure actually gets a bit

01:01:13.140 --> 01:01:18.940
smaller, which is not that visible in that example here because we

01:01:18.940 --> 01:01:21.840
only deleted here two nodes.

01:01:21.840 --> 01:01:29.700
But if you have standard words from normal text, then there are

01:01:29.700 --> 01:01:34.680
certain sequences that will always occur, and so you only have to look

01:01:34.680 --> 01:01:38.620
at relevant positions into your query terms in order to determine

01:01:38.620 --> 01:01:39.940
where that actually occurs.

01:01:39.940 --> 01:01:41.280
Okay.

01:01:42.520 --> 01:01:49.340
These trees that are reduced suffix trees are sometimes called

01:01:49.340 --> 01:01:50.320
Patricia trees.

01:01:51.540 --> 01:01:55.720
I don't know the exact reason why this was called Patricia tree.

01:01:56.340 --> 01:02:03.200
Maybe that the persons who actually designed that had some favorite

01:02:03.200 --> 01:02:04.120
name, Patricia.

01:02:04.120 --> 01:02:09.980
But you always have to argue that it has something to do with what you

01:02:09.980 --> 01:02:10.240
do.

01:02:10.700 --> 01:02:14.560
So this is a practical algorithm to retrieve information coded in

01:02:14.560 --> 01:02:15.420
alphanumerical.

01:02:17.380 --> 01:02:21.120
Definitely immediately leads to the abbreviation Patricia.

01:02:21.120 --> 01:02:29.500
So this is a Patricia tree, and this was abbreviated to Pat tree by a

01:02:29.500 --> 01:02:37.020
fellow called Goni, and Goni, like he had, I will come to that a bit

01:02:37.020 --> 01:02:39.320
later, he worked on a larger project.

01:02:39.800 --> 01:02:42.740
For that project, he designed these efficient data structures.

01:02:43.540 --> 01:02:48.640
So Patricia tree is a special suffix tree containing all the suffixes

01:02:48.640 --> 01:02:49.540
of the text.

01:02:50.920 --> 01:02:55.340
Normally, like what I just said, what I just did with respect to this

01:02:55.340 --> 01:03:01.860
01 sequence was looking at all the possible starting positions of

01:03:01.860 --> 01:03:04.360
suffixes in that sequence of characters.

01:03:04.360 --> 01:03:11.420
But if you look into documents for the occurrence of words, then the

01:03:11.420 --> 01:03:16.500
only reasonable starting positions of suffixes are the starting

01:03:16.500 --> 01:03:21.340
positions of words and not starting positions within some words.

01:03:21.740 --> 01:03:27.240
So we would just look at starting positions that are beginnings of

01:03:27.240 --> 01:03:27.880
words.

01:03:30.440 --> 01:03:40.160
Another artificial example is the phrase... I have to make these

01:03:40.160 --> 01:03:46.420
examples always quite short, and so I get these simple sentences here.

01:03:46.420 --> 01:03:55.960
So a search tree, or a Patricia tree, for that phrase could look or

01:03:55.960 --> 01:03:57.040
would look like this.

01:03:58.320 --> 01:04:03.220
Like we start at the initial position of our query term.

01:04:03.220 --> 01:04:10.980
If our query term starts with a P, we know that the only suffixes

01:04:10.980 --> 01:04:19.420
starting with a P are suffixes having the same characters at positions

01:04:19.420 --> 01:04:24.080
2, 3, 4, and 5, but at position 6 they might differ.

01:04:24.080 --> 01:04:31.400
So at position 6, there might be either a blank, or there might be an

01:04:31.400 --> 01:04:31.740
S.

01:04:32.960 --> 01:04:37.440
And so if there is a blank, then we are at position 13, namely here.

01:04:38.340 --> 01:04:43.520
And if there is an S, we are at position 19, which is there.

01:04:43.520 --> 01:04:49.240
So the first position would be Peter, the second position would be

01:04:49.240 --> 01:04:49.600
Peterson.

01:04:51.280 --> 01:04:57.140
And the same way, if the first letter would be a W, then you would

01:04:57.140 --> 01:05:02.080
have to actually check for the second letter of your query term.

01:05:02.080 --> 01:05:04.640
This could be E or O.

01:05:04.760 --> 01:05:09.980
If it is a different character from E and O, you certainly would just

01:05:09.980 --> 01:05:15.520
cancel your search and say, okay, this word doesn't occur in the

01:05:15.520 --> 01:05:16.800
document, we are done.

01:05:16.800 --> 01:05:22.640
But if it occurs, then it will have some of these, or you will be able

01:05:22.640 --> 01:05:24.020
to follow some of these branches.

01:05:25.060 --> 01:05:29.060
And here actually, if it starts with a W, you have to look at the

01:05:29.060 --> 01:05:34.340
first, second, and third character, and then you can determine the

01:05:34.340 --> 01:05:36.700
starting position of that word in the document.

01:05:36.700 --> 01:05:45.120
So in this way, we build up these search trees, or Patricia trees,

01:05:45.600 --> 01:05:48.480
corresponding to the suffixes of a text.

01:05:49.840 --> 01:05:54.220
And if we have such a search tree, we certainly want to use this

01:05:54.220 --> 01:05:58.560
search tree, this Patricia tree, for locating or finding out whether a

01:05:58.560 --> 01:06:00.200
certain word occurs within the document.

01:06:00.200 --> 01:06:02.380
That was the purpose of all this.

01:06:03.380 --> 01:06:09.460
And so, as I said, if you have any query term, you just scan through

01:06:09.460 --> 01:06:13.580
your data structure, compare your query term with those letters, and

01:06:13.580 --> 01:06:18.900
follow the appropriate paths, and then get to the final positions, to

01:06:18.900 --> 01:06:22.300
the leaves, where you have an indication that it is starting at

01:06:22.300 --> 01:06:24.920
position 5, 1, 10, or maybe 19.

01:06:53.500 --> 01:06:54.900
So, that's the way we use it.

01:06:54.900 --> 01:07:01.160
So, this is something like, here we just look at positions 1 and 6,

01:07:02.260 --> 01:07:08.460
and we would say if at position 1 we have a P, and at position 6 we

01:07:08.460 --> 01:07:18.720
have a blank, then this is a query starting at position 13, but we

01:07:18.720 --> 01:07:22.240
have not actually checked whether it is the word Peter.

01:07:22.240 --> 01:07:27.320
So, you would have to actually scan now through the document and check

01:07:27.320 --> 01:07:33.280
whether there is actually the word Peter, and not Potter, or Potter,

01:07:33.520 --> 01:07:35.100
or whatever.

01:07:35.100 --> 01:07:45.240
So, you actually have to compare or to look at all the words in the...

01:07:45.240 --> 01:07:49.680
you have to look at all the letters in your query term, but you only

01:07:49.680 --> 01:07:56.040
do that at positions where it is most likely that this pattern

01:07:56.040 --> 01:07:58.200
actually occurs at that position.

01:07:58.200 --> 01:08:04.200
So, first to determine the starting point for pattern matching, and

01:08:04.200 --> 01:08:12.320
then you compare search term and suffix, and then you can find out

01:08:12.320 --> 01:08:14.640
whether that search term actually occurred.

01:08:16.140 --> 01:08:18.120
That's the searching in a pet tree.

01:08:19.480 --> 01:08:21.260
Then, what about insertion?

01:08:21.380 --> 01:08:26.720
You want to insert another word into your search tree.

01:08:26.820 --> 01:08:31.480
That means you have a change in... you would say you have either

01:08:31.480 --> 01:08:35.760
another word which should be incorporated, or you have a change in

01:08:35.760 --> 01:08:36.340
your document.

01:08:36.340 --> 01:08:43.840
And what you actually do is you have to look at all the different

01:08:43.840 --> 01:08:48.140
suffixes, and if there's a new suffix because you have a new word in

01:08:48.140 --> 01:08:53.160
your document, you just have to rebuild that tree.

01:08:53.160 --> 01:08:56.680
And this is similar to the insertion into search trees.

01:08:57.100 --> 01:08:59.900
I will not go into examples for that.

01:09:00.000 --> 01:09:04.080
We could briefly look at the next slide.

01:09:04.940 --> 01:09:11.760
Here is the pet tree that results from this very artificial string,

01:09:12.120 --> 01:09:14.360
abb, abbb, abbc.

01:09:15.170 --> 01:09:17.660
So, very strange string.

01:09:18.160 --> 01:09:23.540
Here, actually, I would look at all the suffixes starting at these

01:09:23.540 --> 01:09:25.240
different positions.

01:09:25.560 --> 01:09:28.220
So here, this is just a sequence of letters.

01:09:29.020 --> 01:09:33.160
In a normal document, I said I would only look at suffixes starting at

01:09:33.160 --> 01:09:36.400
the starting positions of words.

01:09:36.400 --> 01:09:39.500
Here, it doesn't make sense to do it that way.

01:09:40.440 --> 01:09:46.260
Okay, so let's go back to our previous slide that was insertion.

01:09:46.260 --> 01:09:53.080
Now, we would like to find out something about the frequency of a

01:09:53.080 --> 01:09:54.580
certain word within the text.

01:09:55.180 --> 01:10:00.260
How can we find out that by looking at such patterns?

01:10:00.260 --> 01:10:10.160
So, the word... let's go back to the previous slide.

01:10:10.840 --> 01:10:18.680
Here, we know that there are two suffixes starting with the letter P.

01:10:19.720 --> 01:10:26.420
Because the number of leaves below that node here, that we get to if

01:10:26.420 --> 01:10:29.820
we have a suffix starting with a P, the number of leaves is two.

01:10:30.860 --> 01:10:34.100
So there are two suffixes starting with that prefix.

01:10:34.100 --> 01:10:42.620
That means the term that we have looked at so far occurs two times

01:10:42.620 --> 01:10:43.800
within our document.

01:10:45.080 --> 01:10:52.000
Or, there are one, two, three, four terms or suffixes starting with

01:10:52.000 --> 01:10:52.680
the letter W.

01:10:53.680 --> 01:11:00.880
So, we have scanned through some prefix of our search term and

01:11:00.880 --> 01:11:07.960
determined in a very easy way the number of occurrences of that term

01:11:07.960 --> 01:11:15.220
within our document just by counting the number of leaves in that

01:11:15.220 --> 01:11:15.720
subtree.

01:11:17.600 --> 01:11:24.880
And so, we have a simple way of determining the frequency of a word

01:11:24.880 --> 01:11:26.240
within a text.

01:11:26.620 --> 01:11:31.600
It's just the number of leaves in the corresponding subtree of the pet

01:11:31.600 --> 01:11:34.240
tree, as I just indicated.

01:11:35.120 --> 01:11:38.240
You could look for the most frequent word in the text.

01:11:39.080 --> 01:11:42.900
There, you would just search all the paths from the root to the next

01:11:42.900 --> 01:11:49.840
blank in your sequence and determine the largest subtree that you

01:11:49.840 --> 01:11:50.100
found.

01:11:50.240 --> 01:11:56.020
That would be the word with the highest frequency, the highest number

01:11:56.020 --> 01:11:56.620
of occurrences.

01:11:56.620 --> 01:12:05.040
You could perform a prefix search, determine all the words matching a

01:12:05.040 --> 01:12:07.260
given prefix, like your term.

01:12:07.820 --> 01:12:11.820
You just scan through a prefix of your search term and you immediately

01:12:11.820 --> 01:12:16.280
get all the suffixes matching a given prefix by just looking at all

01:12:16.280 --> 01:12:22.200
the positions indicated in the leaves of that subtree that you looked

01:12:22.200 --> 01:12:22.840
at so far.

01:12:23.540 --> 01:12:28.120
You could perform a range search, looking at all the words in between

01:12:28.120 --> 01:12:29.940
certain letters.

01:12:31.360 --> 01:12:35.140
And you could also look for the longest repetition search.

01:12:35.560 --> 01:12:39.920
That would be the longest path from the root to an inner node.

01:12:39.920 --> 01:12:57.420
You could also do regular expression search and things like that.

01:12:57.420 --> 01:12:57.420
Okay.

01:13:00.640 --> 01:13:05.320
This was, again, this string or this pet tree for this artificial

01:13:05.320 --> 01:13:06.140
string here.

01:13:07.220 --> 01:13:11.840
And now there are, if you'd like to actually build an index structure

01:13:11.840 --> 01:13:14.660
from that, you try to do it as efficiently as possible.

01:13:14.660 --> 01:13:23.460
You notice that maybe it's not that efficient to have one letter per

01:13:23.460 --> 01:13:29.360
node, so sometimes you just do some bucketing.

01:13:29.360 --> 01:13:35.780
So here you perform, or you build super nodes, build nodes that, like

01:13:35.780 --> 01:13:38.460
here, that is your tree.

01:13:38.980 --> 01:13:45.700
You could just maybe take a few of these nodes and just some subtree

01:13:45.700 --> 01:13:50.320
and collapse that into one super node, which is done in a B-tree.

01:13:50.320 --> 01:13:55.960
Maybe you could combine ideas from B-trees with these ideas in

01:13:55.960 --> 01:13:57.560
Patricia trees.

01:13:58.620 --> 01:14:05.120
Or you could actually do that in a very extreme way, and there you

01:14:05.120 --> 01:14:07.040
would have a pet array.

01:14:07.040 --> 01:14:13.940
So you would actually have an array corresponding to a pet tree.

01:14:15.060 --> 01:14:19.400
So the corresponding array would look like this, which is a

01:14:19.400 --> 01:14:26.040
lexicographical ordering of your words.

01:14:26.040 --> 01:14:29.420
So what could you do to search now?

01:14:30.100 --> 01:14:38.320
You would look for the first letter in your query term, then here in

01:14:38.320 --> 01:14:42.460
this, since this is a sorted data structure again, you could just

01:14:42.460 --> 01:14:48.140
perform binary search on your first character, get to a certain

01:14:48.140 --> 01:14:53.460
position, and then you have the links corresponding to these things.

01:14:53.460 --> 01:14:56.860
So here you could just perform this lexicographical search.

01:14:56.960 --> 01:15:02.580
Lexicographical search, as you know, means you look at a certain word

01:15:02.580 --> 01:15:07.280
and you scan through the symbols from left to right.

01:15:07.280 --> 01:15:15.800
And you just, if you do that, you can get to the appropriate positions

01:15:15.800 --> 01:15:24.040
here in your array, and that way find the appropriate starting

01:15:24.040 --> 01:15:26.200
positions of your suffix.

01:15:26.200 --> 01:15:34.500
This could be done in situations where you have quite static data

01:15:34.500 --> 01:15:37.560
structure or quite static document base.

01:15:38.120 --> 01:15:45.180
In a situation, as I mentioned, with frequent dynamic changes, you

01:15:45.180 --> 01:15:49.980
would not build up arrays because you have the problems with actually

01:15:49.980 --> 01:15:54.240
maintaining such a sorted list.

01:15:57.340 --> 01:15:59.140
Some remarks on that.

01:15:59.260 --> 01:16:01.440
Why do people come up with such data structures?

01:16:03.160 --> 01:16:11.720
I mentioned that Petri's have been designed because Mark Gournay was

01:16:11.720 --> 01:16:13.020
working on a larger project.

01:16:13.020 --> 01:16:22.520
There was a group of people getting the task to build up an index for

01:16:22.520 --> 01:16:24.180
the Oxford English Dictionary.

01:16:24.260 --> 01:16:28.260
The Oxford English Dictionary is quite a large document.

01:16:29.800 --> 01:16:33.340
It was 600 megabit of text.

01:16:33.340 --> 01:16:36.360
Well, that was 20 years ago.

01:16:36.460 --> 01:16:40.200
20 years ago, 600 megabit was a very, very large document.

01:16:40.360 --> 01:16:41.760
Today, it's not that large anymore.

01:16:42.860 --> 01:16:48.540
But 600 megabit of text was quite large, 20 years ago.

01:16:48.540 --> 01:16:56.320
There you had, like in the 80s, you had something like one megabit

01:16:56.320 --> 01:17:01.440
secondary memory or something like that, usually, or hard disk of that

01:17:01.440 --> 01:17:04.040
size, or you had one megabit.

01:17:05.400 --> 01:17:11.640
That was a revolution to have one megabit of main memory.

01:17:11.640 --> 01:17:13.340
You get 600 megabit.

01:17:13.420 --> 01:17:15.460
So it was out of question.

01:17:15.720 --> 01:17:21.020
It would never be possible to have the complete index or the complete

01:17:21.020 --> 01:17:23.020
document in main memory.

01:17:23.460 --> 01:17:25.820
Completely unconceivable.

01:17:26.380 --> 01:17:29.040
So you needed efficient data structures.

01:17:29.040 --> 01:17:35.420
So the main memory size was one or two or three megabit, but not more.

01:17:36.700 --> 01:17:43.280
And if you look at such a task, you look at the time that it actually

01:17:43.280 --> 01:17:50.360
needs to access a data structure of that size, you come up with a

01:17:50.360 --> 01:17:55.460
quick, you could say, over-the-thumb computation, which is an

01:17:55.460 --> 01:18:03.100
important thing if you try to get an estimate of the complexity of a

01:18:03.100 --> 01:18:03.920
certain task.

01:18:04.240 --> 01:18:06.900
You should always perform something like this.

01:18:06.900 --> 01:18:11.700
So here, a quick estimate of the time you need for accessing, for

01:18:11.700 --> 01:18:23.200
building up such an index, or what does it cost to insert something in

01:18:23.200 --> 01:18:24.860
that index.

01:18:24.860 --> 01:18:32.440
So, on the average, the insertion of a new entry into the index needs

01:18:32.440 --> 01:18:36.120
about log N random accesses to disk.

01:18:36.500 --> 01:18:38.820
Why do we come up with log N?

01:18:38.820 --> 01:18:40.800
We have looked at B-trees.

01:18:41.760 --> 01:18:49.340
There we had something like log N accesses to nodes on the way from

01:18:49.340 --> 01:18:55.820
the root to the leaves, and that means something like log N random

01:18:55.820 --> 01:19:01.120
accesses, with some constant factor that might reduce that.

01:19:01.120 --> 01:19:04.400
So, essentially, it's order of log N.

01:19:05.940 --> 01:19:10.720
So, every access to disk, which is a random access to disk at

01:19:10.720 --> 01:19:16.300
different positions, means we need time for that.

01:19:16.300 --> 01:19:26.060
And access time to the hard disk is something like, if you have a fast

01:19:26.060 --> 01:19:33.800
disk, something like 5 milliseconds, or maybe something like 5 to 10

01:19:33.800 --> 01:19:34.320
milliseconds.

01:19:34.960 --> 01:19:36.920
Maybe it even takes more.

01:19:37.000 --> 01:19:39.600
At that time, it was a bit more, but not that much more.

01:19:40.620 --> 01:19:48.480
And that means that you have something like 30 disk accesses per

01:19:48.480 --> 01:19:51.460
second to different blocks.

01:19:51.820 --> 01:19:57.720
Certainly, with every disk access, you get a large set of data.

01:19:57.720 --> 01:20:05.360
And if you would access a continuous sequence of pages from disks, you

01:20:05.360 --> 01:20:09.800
could get that information very quickly.

01:20:10.320 --> 01:20:15.240
But the individual access to hard disk takes some time because of

01:20:15.240 --> 01:20:21.340
these requirements of mechanical positioning of the disk head.

01:20:22.340 --> 01:20:31.140
Now, in this index that was built up, there was something like 120

01:20:31.140 --> 01:20:33.340
million index positions.

01:20:34.120 --> 01:20:47.080
If you want to build such an index, by this quite coarse estimate, you

01:20:47.080 --> 01:20:51.580
would need something like 30,000 hours or 3.5 years to just build this

01:20:51.580 --> 01:20:51.980
index.

01:20:53.300 --> 01:20:59.080
This may be too pessimistic, so even if there is just one disk access

01:20:59.080 --> 01:21:08.860
per entry, that means if we enter something new into our index, we

01:21:08.860 --> 01:21:10.380
just need one disk access.

01:21:11.020 --> 01:21:17.020
We still need 46 days of time to build up such an index.

01:21:17.940 --> 01:21:24.300
This was a computation that showed to those people who had this task

01:21:24.300 --> 01:21:28.940
here that they really had to think about efficient data structures.

01:21:28.940 --> 01:21:37.740
And so, they managed in the end to actually build up this index over a

01:21:37.740 --> 01:21:38.020
weekend.

01:21:40.460 --> 01:21:44.080
And that means they really had to do some clever things.

01:21:45.060 --> 01:21:50.820
They did something like they avoided frequent disk accesses, they

01:21:50.820 --> 01:21:58.240
constructed small indices in main memory, so a long list or collection

01:21:58.240 --> 01:21:59.240
of small indices.

01:21:59.240 --> 01:22:08.600
And these had to be merged then into larger lists, larger indices, and

01:22:08.600 --> 01:22:14.720
this was done in an efficient way, such that you had random access to

01:22:14.720 --> 01:22:19.900
these indices only when lists were in main memory, and you had

01:22:19.900 --> 01:22:24.980
sequential access whenever you accessed the secondary memory.

01:22:24.980 --> 01:22:29.640
And sequential access to memory can be done fast, because you can use

01:22:29.640 --> 01:22:35.040
the sequential positioning of blocks on the hard disk, if it's stored

01:22:35.040 --> 01:22:36.240
there in a reasonable way.

01:22:37.040 --> 01:22:40.780
And random access is fast in main memory, and is therefore restricted

01:22:40.780 --> 01:22:41.540
to main memory.

01:22:43.240 --> 01:22:49.960
And in that way, the data structure was efficient, and for that they

01:22:49.960 --> 01:22:51.640
designed these Patricia trees.

01:22:53.000 --> 01:22:57.220
Okay, more information is available in that book that I listed in the

01:22:57.220 --> 01:23:04.480
literature references list, and that's what I wanted to tell you

01:23:04.480 --> 01:23:04.960
today.

01:23:05.600 --> 01:23:11.700
We meet again next week, and I would like to ask you to hand in your

01:23:11.700 --> 01:23:12.480
assignments.

01:23:13.720 --> 01:23:17.020
Matthias Bonn is sitting there, he has put up these boxes.

01:23:17.400 --> 01:23:19.980
Please put in your assignments into these boxes.

01:23:20.740 --> 01:23:22.800
Thanks again for today's attention.

