APPROXIMATE INITIAL SUBSTRING SEARCH

Abstract^*

A data structure useful for prefix search in a very large dictionary with an unlimited query string is discussed. This problem is important for morphological analysis of inflective languages, including particularly difficult cases such as German word concatenation or Japanese writing system that does not use spaces; similar tasks arise in DNA computing. The data structure is optimized for locality of access: to find all necessary records, access to only one block (page) of the main data storage is guaranteed, which significantly improves performance. To illustrate its usefulness, the algorithms of exact and approximate search are described, with application to morphological analysis and spelling correction. The algorithms for building, exporting, and updating the data structure are explained.

Keywords: prefix search, approximate prefix search, approximate string matching, morphological analysis, spelling correction, natural language processing, DNA computing.

1 Introduction

A typical database system is basically a device that can answer a simple question: Which records have the key x exactly equal to the given query string s? In some cases we are interested in another question: Which records have the key x that is a prefix of the given string s? By prefix, we mean an initial substring.

Definition 1. By x s, we denote the fact that the string x is a prefix (initial substring) of the string s, i.e., $ a string y (possibly empty) such that s = x y. Here x is a finite-length string and s, y are finite or infinite strings.

Thus, given a query string s, finite or infinite, and the database D, we can distinguish the tasks of finding all records with the keys x:

Task 1. Exact search: {x Î D | x = s}; s is finite.

Task 2. Approximate search: {x Î D | x » s} for some criterion of similarity »; s is finite.

Task 3. Prefix search: {x Î D | x s}.

Task 4. Approximate prefix search: {x Î D | $ x¢ » x: x¢ s} for some criterion of similarity ».

In this paper, we are interested in the latter two cases, prefix and approximate prefix search, and specifically when the dictionary D is very large in terms of the number of records.

Note that unlike classical database search – Task 1, in approximate prefix search the length of the key x looked for is not known a priori. As we will see below, for a long enough query string s even its length needs not to be specified in the query, i.e., s can be considered infinite. This task has some specificity and raises some technical issues that the classical task of exact key matching does not face – specifically, the problems related to data access locality.

In this article, we will discuss these issues and propose a data structure optimized with respect to them, along with the basic operations and sample applications. The structure proves to be a variation of a 2-level B-tree with slight redundancy.

In Section 2, we explain the motivation for the task of prefix search in very large dictionaries, formulate the technical problem of access locality faced with by the prefix search task, and discuss the related work. In Section 3, we explain the proposed data structure, step-by-step improving the naïve approach to the solution. In Section 4, we give a sketch of the algorithms for the basic operations with this data structure. Finally, in Section 5 we illustrate its usefulness for morphological analysis of natural languages, and specifically, for spelling correction, i.e., approximate string matching combined with decomposition of the string by a set of dictionaries.

2 The problem: Prefix Search in Very Large Dictionaries

Existing algorithms for prefix search, unlike those for exact search, require addressing different physical locations in the dictionary, which causes very inefficient behavior with block-oriented physical storage devices such as hard disks or virtual memory mechanism.

Below we give more details on the prefix search and the data access locality problem. The section 2.1 justify the task of prefix search in very large dictionaries. The section 2.2 explains the technical problem arising in the search in large dictionaries. Finally, the section 2.3 describes the related work found in the literature and some other approaches to the mentioned problems.

2.1 Prefix search in very large dictionaries: motivation for morphological analysis and DNA computing

One of the main sources of prefix search tasks are the cases of analysis of the strings formed as a concatenation, without any delimiter, of the substrings possibly belonging to different dictionaries: s = s₁ ... s_n, s_i Î D or s_i Î D_i.

A typical example is the analysis of a DNA chain s as a concatenation g₁ ... g_n of individual genes g_i Î D, while the number of genes |D| can be very large.

Our main motivation, however, was morphological analysis of words in highly inflective languages such as Spanish, French, or Russian. In natural language morphology, an example of the task is the following. Let us consider a dictionary (a database) of Spanish stems^[1] like the one shown in Table 1.

Table 1. A fragment of Spanish dictionary.
Key	Value
.....	.....	.....
clar-	adjective	claro
co-	prefix	co-
com-	verb	comer
con-	preposition	con
concentr-	verb	concentrar
const-	verb	constar
constancia-	noun	constancia
constante-	adjective	constante
constat-	verb	constatar
constelación-	noun	constelación
constipad-	noun	constipado, -a
constru-	verb	construir
construcción-	noun	construcción
constructiv-	adjective	constructivo
constructivismo-	noun	constructivismo
consult-	verb	consultar
.....	.....	.....

In fact, the dictionary is much denser (Diccionario, 1992); we have omitted some lines for brevity. Spanish words have highly inflected forms. For example, the stem^[2]constipad- gives rise to the wordforms constipado, constipada, constipados, constipadas; the stem clar- to the wordforms claro, clara, claros, claras, clarísimo, clarísima, clarísimos, clarísimas, claramente; the stemconstru- to about 700 wordforms such as construir, construyo, construido, constrúyanmelo^[3], etc. As we see, finding the stem given a word form is the prefix search, Task 3 above.

This task is ambiguous, since we need to find all prefixes and not just the longest one. Indeed, here is an example where the longest prefix is not the right one. Suppose the Spanish dictionary contains such stems as ajen- for (derecho) ajeno, moren- for (color)moreno, escalen- for (triángulo) escaleno, etc. Then, for the verb forms like (que ellos) ajen, moren, escalen (algo), the stems mentioned above are the longest ones, while the shorter stems aj-(ar), mor-(ar), escal-(ar)are the correct ones. Since making any linguistically meaningful decisions is not the business of the string search mechanism, we conclude that the dictionary search procedure should enumerate all prefixes it found, starting from the longest one since usually they have higher probability to prove to be the right one.

Though the examples of stem ambiguity are rather rare in Spanish, they are much more frequent in languages with more developed morphological inflection, such as Russian, Finnish, Turkish, just to mention a few. What is more, in German compound words are very frequent, so that one word can consist of many stems: for example, the first stem of kommunikationstechnik is kommunikation-. The extreme situation is, say, in Japanese writing system which just does not use spaces between words,^[4] much like the situation with DNA.

In such cases, when it is impossible to separate a single word before addressing to the stem dictionary, the query string s used to look for its stem (prefix) x is actually the whole text, i.e., can be though of as unlimited in length.

On the other hand, there is another good reason to consider the whole unlimited text as the query string to search the stems, even in the languages that do mark word boundaries with spaces or other punctuation signs. In all such languages, such word delimiters are sometimes overused. For example, a Spanish preposition a través de contains two spaces and thus looks like three words, though there are no technical reasons to treat it so. Thus, even in Spanish, like in Japanese or German, we can not always easily detect the word boundaries!^[5] Fortunately, in order to use the search mechanism discussed in this paper, we really do not need to: it is enough to include the word a_través_de in the dictionary, with its both spaces, as one record key.

Thus, morphological analysis of natural languages requires prefix search in very large (stem) dictionaries.

2.2 The data locality problem and its motivation for morphological analysis

The second consideration contributing to the problem is of purely technical nature – the way the block-oriented storage devices process data access.

Data locality means access to a small region of data storage per operation. Suppose you need to retrieve the first name, last name, and age of a person from a file.

· If all you need to do is to read a line 1234 where this data is stored, then this operation is local.

· On the other hand, if you need to read the name from the line 678, surname from 901, and age from 2345, then this way of data retrieval is not local since you have to address several different regions of data storage to fulfill a single operation.

With most of data storage devices currently in use, local access is on average much faster than non-local access is. The most obvious example is a tape where the access time is proportional to the distance between the addressed locations. Disk storage, be that a hard disk or a CD, is less sensible to data locality but still sensible: in this case, the access time is roughly proportional to the number of addressed locations, so that in the example above, the second variant is trice slower than the first one. This is because the data are exchanged with a disk storage device by blocks, or sectors, of fixed size. Reading, say, a kilobyte of sequential data takes nearly the same time as reading one byte, but repositioning the reading head to another location takes significantly more time.

It might be argued that large amount of available memory renders the data locality problem unimportant since the cost of addressing to the random access memory (RAM) is proportional just to the number of retrieved bytes regardless to their location. However, this is not completely true. Under the operating systems widely used nowadays, like Windows 98 or NT, the large (virtual) memory is much “less random access” than in good old times of DOS.

Indeed, most of the physical memory is occupied by the active concurrent programs and the operating system itself, while random parts of any data loaded in memory are swapped out to the hard disk. Thus, under such systems, loading a very large dictionary into memory makes little difference from storing it on the disk. Simple experiments show that data locality problem keeps its importance even for a data structure that does “fit in memory” – i.e., has such size that it could be stored in RAM if no operating system nor concurrent processes existed.

On Intel processors, the quantum – so called page – of memory is four kilobytes. While sequential accesses to the same page can be considered local, access to a different page with a certain probability causes hard disk read and write operations because of virtual memory swapping.

This, for an algorithm to work fast, it should mostly address data within a four-kilobyte address window and avoid random accessing to many addresses differing more than four kilobytes.

The data locality problem is especially important for natural language analysis applications that use huge amount of information in parallel for the analysis of each phrase. While each individual dictionary – a morphological dictionary, syntactic dictionaries such as subcategorization and lexical attraction (collocation) (Yuret, 1998) dictionaries, a semantic network dictionary (Cassidy, 2000; Fellbaum, 1998), world knowledge dictionaries (Lenat & Guha, 1990), etc. – might fit in the physical memory, all of them together will not. Meanwhile, the processing of each phrase requires using of all of these resources in parallel, before the processing of the next phrase begins, and with tens of thousands of different words occurring in essentially random order. Thus, natural language processing tasks are unique in their highly intensive memory using that causes intensive virtual memory disk swapping. With this, no single large dictionary can be considered to be “completely stored in RAM.”

However, only morphological analysis faces the problem of data locality. All other dictionaries – syntactic, semantic, and world knowledge ones – can be stored in a classical database with exact search – Task 1 in the terminology of Section 1, for which data locality problem can be easily avoided. With this, not more than one data storage access per word is necessary for each of the dictionaries but the morphological one. Thus, the data locality problems with which the morphological analysis faces is the bottleneck of the data access during natural language analysis.^[6]

Returning to the search algorithms, let us consider an example of a data locality-friendly one and an example of one that is not.

· The classical hash table is data-local. To find the record with the key construcción, its hash value is calculated, which defines the address starting from which (at this address or at some near one in case of collision) the record is located. However, hash table is good for exact search – Task 1, but it does not work for prefix search – Task 3, for which a potentially infinite number of hypotheses are to be tried.

· Binary search in an alphabetically ordered list works fine for prefix search task. The main search operation in this case is that of finding the alphabetic place of a given string in the list.^[7] However, this structure is not local: for a binary search in, say, a 1000 items, one needs to address 10 different locations in the list, some of which are quite far from each other.

With binary search, the problem is not only in a wrong algorithm. Prefix search task for an alphabetically ordered list is inherently non-local, at least if all possible prefixes are to be found. Indeed, let us consider the search procedure for the Spanish verb form consto. Its alphabetic place in the dictionary in Table 1 is after constelación-, while the true answer is const-, and what is more, since the prefix search procedure should find all possible prefixes, the records con-, co-, andc- are also to be retrieved. In (Diccionario, 1992), the distances between these locations are as follows: c‑ ¼ (9327 words) ¼co‑ ¼ (2101 words) ¼con‑ ¼ (1445 words) ¼const‑ ¼ (95 words) ¼constituyentes- (consto). Since these are the results of the query and thus any algorithm is to address these locations, no algorithm based on this data structure can be local.

In this article, we will discuss a data locality-friendly modification of alphabet search. Namely, at the cost of slight redundancy, our data structure guarantees that after retrieval of exactly one block (page) of data from the main storage, all the records with the keys being initial substrings of the given string are found.

2.3 Related work

There is a vast literature on the prefix-matching data structures, such as tries, B-Trees, Prefix B-Trees, B+-Trees, extendible hashing, etc. (Aho, 1990; Bayer & Unterauer, 1977; Comer, 1979; Gusfield, 1997). The theory of such structures is developed in the context of the database index management rather than on dictionary application; thus, these structures and the corresponding algorithms are optimized with respect to efficient updating (insertions and deletions) (Johnson & Shasha, 1993), which is rather irrelevant for our case. On the other hand, these structures are not optimized for the access locality (see the next section) that is the main topic of our paper. Actually, what we propose is similar to a 2-level B-tree (that we then extend to an n-level B-Tree) with the additional feature of repetition of some keys, which optimizes the structure with respect to data access locality.

In our paper, we will illustrate the usefulness of the suggested structure for spelling correction. The literature on approximate string matching is also huge (Aho, 1990; Frakes & Baeza-Yates, 1992; Gusfield, 1997). We will consider a kind of a nearest edit distance spelling correction, with a set of editing operations specific to typographic errors in texts. However, our discussion of spelling correction is not aimed on proposing of a new approximate string-matching algorithm per se. Instead, we will describe a fast spelling correction algorithm that works with the same data structure that is used for exact string matching within an integrated morphological analysis system, without any modifications of the structure that would slow down the normal (exact) operation nor require additional storage space.

Data locality is a well-known issue in programming and computer science (Knuth, 1998). Recently there is no much discussion of this topic since as the amount of the available RAM increases the problem looses its importance for many applications. As we have shown, however, this is not the case of morphological analysis module of a natural language analysis system.

In computational linguistics, there are two major approaches to morphological analysis, one based on morphonological transformation and another on simple dictionary lookup. The first approach is exemplified by the two-level morphology (Koskenniemi, 1983). With such methods, the program first guesses what the stem of the wordform could be, and then checks each hypothesis against the dictionary. E.g., for the wordforms like ladies and caries, the hypotheses lady- and *cary- would be tried, correspondingly. Thus, no prefix search is required. However, the method is not data-local due to multiple hypotheses: for the example above, the hypothetical stems *ladies- and caries-, as well as many others, are to be looked for in the dictionary, too.

The other, more widely used currently approach to morphological analysis was called by Hausser (1999) allomorph method. With this method, the wordform is considered as a concatenation of substrings (allomorphs) listed in separate dictionaries, as we discussed in the Section 2.1. There are two variations of this method that can be conventionally called right-to-left and left-to-right analysis (Gelbukh, 1995). With the first method, a word is separated from the letter string, then all possible affixes are determined by the small lists of affixes, and then each hypothetical stem is looked for in the dictionary. This leads to the same problems with multiple hypotheses as with two-level morphology discussed above.

Finally, the left-to-right analysis uses the whole word or even the whole text from the current analysis point on, as the query string s for the dictionary lookup procedure that uses prefix search to find all hypothetical stems in one search operation. We have discussed the advantages of this method in the Section 2.1. In this work, we show that such search can be made in a data locality-friendly way.

Thus, the main antecedents of our work are left-to-right allomorph method of morphological analysis (Gelbukh, 1995; Hausser 1999) and the techniques of prefix search in B-trees. What is new in our task is considering data locality problem with respect to the prefix search, and what is new in our solution is the repetition of some records in the data structure, which guarantees one storage block access per word, as explained in Section 3.2.

3 Data structure

Hereafter we will usually speak of the keys or the records that have these keys interchangeably when this does not create any confusion. For example, we will write “list of keys” implicitly supposing that attached to each key in the list is some information (value) structure of which is not important for our discussion.

We will start from a naïve approach and then describe two improvements to it. Namely, let us start from an alphabetically ordered list of keys D that appears to be quite natural data structure for the prefix search task. As we have seen in Section 2.2, there are two locality problems with this list.

First, even if we are interested in finding only the alphabetic place of a string s in D,binary search is not a local algorithm.

Second, if we are interested in finding the initial substrings of s, especially all of them rather than only one, then non-locality is an inherent property of such a data structure, i.e., it does not depend on the algorithm used.

In this section, we will consider these two problems and suggest the corresponding solutions. In Section 3.1, we will describe a simple solution to first problem, which appears to be a kind of 2-level B-tree. In Section 3.2, we will describe a solution to the second problem, which is our novel contribution.

3.1 Algorithm locality: index of blocks

By a £ b we will denote the alphabetic order on the letter strings.

Definition2. Alphabetic place D [s] of a string s in D is the number of the records such that r £ s, i.e., the position of the key immediately before the place that s had if it were inserted in D.^[8]

In an appropriate context, we will understand by D [s] the corresponding record itself rather than its number. This should not cause confusion since it is always clear whether we mean a number or a record. Note that the record D [s] = s if and only if s Î D.

Let us split D into segments, or blocks, corresponding to the physical storage units such as disk sectors, memory pages, network packages, etc.: . Such splitting can be done sequentially, from the first to the last record. We add each record to the current block; if adding a record exceeds the block size, we leave some space in the current block unused, form the next block and make the current record first in it. The system can be formally defined as the only system such that:

1. Each is a continuous segment of D which we will denote by analogy with geometry as º {r Î D | a £ r £ b},

2. ,

3. for i ¹j,

4. The total size of all records in each does not exceed the required size of the block, and

5. No can be extended to the right without violating this restriction on the size.

The uniqueness of such a system is obvious by induction by the block number: the first block is defined uniquely, then the second, etc.

Let us consider the task of finding the alphabetic place D [s] of a given string s. Binary search is not a local algorithm. However, it can be easily improved using an index of the blocks.

Let us denote the first key of the block as , and the last key as , so that the block is a segment = [, ] Í D.

Let us consider a set of keys , where is the first key in . We call this set (first-level) index of the blocks.

Theorem1. The alphabetic place of a string s in D is located in the block with the number I [s]: .

Proof: . à

Thus, the necessary block can be found by means of search in the index I. We consider I small enough so that the data locality problem is not applicable to it. What is more, it can be made even smaller by removing redundant letters of some keys that do not change the alphabetic place of any string. Namely, let us replace each element by its minimal initial substring that is not an initial substring of the last key of the previous block : , , and is the minimal with this property (see Definition 1 for the symbol ). Let .

Lemma 1. For any x, y, z, the following conditions are incompatible: , , , and .

Proof: Suppose the first three conditions hold. Since and , then . Since , , and , then by definition of the lexicographic order, . à

Theorem2. The alphabetic place of a string s in I is the same as in : .

Proof: Let i = , j = . First, since and , then . Second, since , , and , then, by Lemma 1, . Thus, < , that means that i = = j. à

Thus, we can use an even smaller index to find the necessary block. First, the place of s is found in , then the block with the corresponding number is retrieved, and then s is found in by any suitable method. Ignoring the problem of locality of search in a small list , we can say that this algorithm of finding the alphabetic place of s in D is local.

In practical implementation, for speed and simplicity we implemented the search for s in by using a second level index: we split into blocks and compiled an index of their first keys. We applied this procedure recursively, splitting that second level index into blocks and compiling a third level index , which in our case consisted in only one block and thus did not need in any its own index.

3.2 Data locality: block prefix

Though, as we have shown in Section 3.1, the problem of finding is local, the problem of finding an , is not. As it was discussed in Section 2.2, first, x can be located far from ; second, if we are interested in finding all such keys, they can be located quite far from each other, as we have seen with the keys co-, con-, const-, etc. Such keys are probably located in different blocks .

What we propose is to add to each block all keys that can be necessary for any query addressing this block. As we will show, the changes will result only in appending some small amount of information at the beginning of the block. We call this additional piece of information a block prefix.

Namely, for any s such that , for any such that , we propose to duplicate the key x with its corresponding record to the block B. Probably some records of D will be moved to other blocks since the size of the block is fixed; thus this operation can cause changing the way D is split into blocks and increase the number of blocks. However, with this operation we reach our goal: by retrieval of only one block , namely, such that , we find in it, locally, all such keys that .

Does this cause a significant increase of the dictionary size? Does this result in complicated algorithms of search and of dictionary building? Let us examine closer the set of keys that are to be added to B and the way D is now split into .

Definition 3. A set of strings S Í D is closed if "x Î S and "y Î D, if then y Î S.

Definition 4. The closure of a set S Í D is the minimal closed subset of D that contains S.

Lemma 2. Definition 4 is correct, i.e., there exists exactly one minimal closed subset of D that contains S, namely, = {y Î D | $x Î S such that }.

Proof: Obviously, this set is closed and contains S, since for any x, . On the other hand, it is contained in any other set with these two properties. à

Now we can formalize our solution. Let us split D into a set of segments (which are not necessary the same that we discussed before), such that each closure , rather than the block itself, has the size corresponding the physical storage unit. All the statements of the previous sections hold for this segmentation of D.

Let this segmentation of D be minimal, i.e., splits D into a minimal possible number of blocks. We propose to use as our main data structure (instead of the original set of blocks ). Now we will show that this structure is only insignificantly larger than the original set D: S || » |D|.

Lemma 3. For any s and any , if , then .

Proof: By definition of , if and , then . If x = , then , else, by Lemma 1, . à

Lemma 4. For any block B Í D and any, the following conditions are equivalent:

· such that and ,

· such that .

Proof: By Lemma 3. à

Thus, the set of keys to be added to a block B is exactly the set of all initial substrings , of all keys . What is more, the keys to be actually added are just initial substrings of the very first key since the other ones are already in B. By b < a we denote the alphabetical order.

Lemma 5. Let b be the first key of a block Í D, x Î D, . Then if such that , then .

Proof: Since x Î D, , and x Ï B, then x < b. Since b £ a and , then, by Lemma 1, . à

Thus, finally, the only keys to be duplicated into the block in order to guarantee access locality are the initial substrings of the very first record of the block:

Theorem 3. For any segment S = [b, e] Í D, = S È{y Î D | , }.

Proof: By Lemma 5. à

Thus, each closed block is obtained from the correspondent block by adding before its first record all records from D whose keys are initial substrings of only one, namely the first one, key of . Clearly, there are not many such records in D.

For example, for a block of Spanish dictionary starting with the stem constat- for constatar, only the records with the keys const-, con- and co- are duplicated into the block, no matter how large it is.

We carried out the experiments with the AMORE morphological analysis and synthesis system we have developed (Gelbukh 1995). In the dictionary, together with each stem, the morphological information of the size about 20 bytes on average was stored. The blocks were of 1 kilobyte and contained about 30 records each on average.

In our experiments, there were about 2 to 3 records in average added to each block, so the redundancy was about 10%, and would be proportionally less if we used larger blocks, say, of 4 kilobytes.

With such a small redundancy, we guarantee the complete locality of data: any initial substring query is processed with retrieving of exactly one block of the main data storage.

4 Operations: searching, building, and exporting

To prove the practical usefulness of the proposed data structure, we provide the algorithms of its building, exporting, and searching. We do not provide, however, any fast algorithm of insertion and deletion, since there operations are rather irrelevant for linguistic dictionaries, which are not updated very frequently, though they are relevant for databases (Johnson & Shasha, 1993).

4.1 Search

We have already discussed the idea of the search algorithm in the previous sections; here we formalize it. Given a string s and a dictionary (database) D, our task is to find all such that (Task 3 in Section 1).

Suppose we have an algorithm A to find the alphabetic place [s] in a given block, see Section 4.1.1 below. Then, we start from the highest level of index . With A, we find the number , and retrieve the block with this number from . Then we repeat the search in this block, extract the corresponding block from , etc., until , and finally retrieve from the main storage the block that contains D [s] and thus all such that .

The description of the algorithm A is not essential for the present article. It does not affect the issue of locality since all its memory accesses are within one block (physical memory page). However, we will discuss it briefly because of its importance for the practical implementation of a system based on the proposed data structure. The algorithm depends on the internal representation and possible compression of data in the block.

The Sections 4.1.1 and 4.1.2 are not essential for understanding the main topic of the article.

4.1.1 Search in a compressed block

Data compression is important for our algorithms because of the fixed size of the block. Really, as we have seen in the Section 3.2, additional overload to each block – the block prefix size – does not depend on the block size. Thus, the percentage of redundancy depends on the size of block. If the size of block prefix is comparable with the total block size, our algorithms are rendered unaffordable.

In our blocks, data is ordered alphabetically, so that the keys with common initial substring are located together. To compress such data, Cooper compression works well (Cooper, 1958): instead of an initial substring equal to that of the immediately previous key, only the number of common letters is indicated. For example, the keys in Table 1 are compressed in the following way: 0/clar-, 1/o-, 2/m-, 2/n-, 3/centr-, 3/st-, 5/ancia-, 7/te-, 6/t‑, 5/elación-, 5/ipad-, 5/ru-, 7/cción-, 8/tiv-, 11/ismo-, 4/ult-. The first key in the block always has the Cooper coefficient 0.

Such compression is especially useful for morphological dictionaries since in many languages there are variants of stems for the same word differing only in the last few letters. English examples are lady – ladi-es, stop – stopp-ed, Spanish traduc-ir – traduj-o – traduzc-a; indic-ar – indiqu-en; alcanz-ar – alcanc-en, Russian prevozmo-ch’ – prevozmog-u – prevozmozh-esh’^[9] (Zaliznyak, 1987).

Here we will not give any detailed comments on the search algorithm but only a short description; we again ignore the complications caused by the records with equal keys. The main personage of the algorithm is the common initial substring length e with the meaning of the length of the maximal common initial substring of the query string s and the current key being processed.

The keys are examined one by one. Initially, i = 0 and c = 0. Let the Cooper coefficient of the current key be c. If , the algorithm just proceeds to the next key . Otherwise, starting from the position c, the string s is compared with , and the new value for e is determined as the first position differing the two strings. During this comparison, two special situations can be found. If s < or the block is exhausted, the search process is over. If , one of the keys we are looking for is found. It is added to the set of the search results. In our implementation, the search results are stored in a LIFO stack, and each found key is pushed onto this stack. After the search is over, the results can be one by one popped out of the stack and examined; with this, they are examined in the order natural for most applications, i.e., from the longest to shortest.

Since the block contains the prefix with duplicated records, all the necessary records are found in this only one block.

4.1.2 Substring chains

The search method described above looks through all the records in the block until the alphabet place of the query string s is found. However, there are algorithms jumping directly to this position [s], for example, binary search. By storing some additional information in the dictionary, we can find all the initial substrings of s starting directly from [s]. For this, let us suppose that with each record in , a pointer to, or just the length of, the maximal such that is kept in the dictionary. Let be a chain of such maximal substrings.

Lemma 6. and , if , then .

Proof: By Lemma 3. à

Thus, to find all the initial substrings of s, the chain is to be looked through starting from [s] until an element found such that . Then the chain is the result of the query. For most applications, this chain is to be examined starting from x.

4.2 Building

In spite of simplicity of the formula for the duplicated records, its application sounds like vice circle. Really, the duplicated records depend on the first records of the blocks, while the block layout itself depends on such duplication, since now it is the size of that is fixed, not that of . However, we propose a simple algorithm for building such a data structure given an alphabetically ordered list of records D. Note that the algorithm builds the complete structure rather than updates individual elements in it.

Our task is to build the structure in the block format discussed above, i.e., to split D into a minimal set of blocks so that and for each block, the size of its closure , where C is a constant, say, 4 kilobytes.

Let us first consider an auxiliary data structure that we call a stack of initial substrings. This is not exactly a LIFO stack, but is similar to it. The stack of initial substrings S has an underlying LIFO stack of records of D, and implements one operation, a kind of push, with the following behavior. When a record with the key s is pushed onto S, then the records are popped out of until the record with a key is found on the top of or is empty. Then, the new record is pushed onto .

Let the records of the ordered list D are read one by one from D and pushed onto S. At the beginning, S is empty.

Theorem4. At each step of this process, after is processed, S is ordered alphabetically from bottom to top and contains exactly all records whose keys are initial substrings of s: S = { | }.

Proof: First, since the new records are pushed onto (after some pop operations), then for each record, only the records that came before it are deeper than it in . Since the records come in alphabetical order, is ordered alphabetically. Second, when a record is pushed onto , the key immediately below it is its initial substring; then, contains a chain of initial substrings; in particular, Þ . On the other hand, if and , then , so that x have come at some step and was pushed onto . However, at no step was it popped out of . Really, if it were popped out by some record y that came after x, then x £ y, , y £ s, and , which is impossible by Lemma 1. à

The algorithm of formation of the dictionary works as follows.

· At the beginning of work, an empty block is created.

· During the work cycle, the records are read one by one and pushed onto S. After that, an attempt is made to append s to the block being currently formed. If there is any block compression as discussed below, it is performed. The new potential size of the block is estimated. If still , then the process repeats with the next record of D.

· However, if with the new record the size would exceed C, then the block , without the new record s, is ready. It is appended to the dictionary structure being formed, and a new empty block is created. The contents of the stack S is copied to from bottom to top, thus making the block closed. With this, s, which is currently on the top of S, becomes the first record of the new “logical” (not closed) block , and contains all the records whose keys are initial substrings of s and is ordered alphabetically. After that, the process repeats with the next record of D.

In our implementation, at the moment of formation of the new block , we dump the key of its first record to compile later the index . The key is truncated according to the definition of (see Theorem 2) since the last record of the previous block is known at the moment of formation of .

Also, with each block we kept a header containing only two numbers: (1) the total number of the records in the block and (2) the size of the block prefix, i.e., the number of the redundant records duplicated into the block, which is at the moment of creation of the new block .

We did not mention here a special treatment for the first and last blocks. Another complication is the special case of records with equal keys. This case can be ignored if we prohibit equal keys in D merging the values of such records when necessary. Alternatively, special precautions are to be taken for a group of records with equal keys not to be split across block boundary.

The described algorithm is single-pass.

4.3 Exporting and updating

Now let us consider the inverse task: given the block structure , restore the ordered list of records D. This task is rather trivial. In our implementetion, each block contains a header with the number of the redundant records in it, so that it is enough to dump out the contents of each block, from first to last record, ignoring the first records of each block. Even if the number were not kept with the block, it would be easily restored by looking for the first record such that a > e where e is the last record of . This procedure is also a one-pass algorithm.

As we have pointed out, in this paper we do not discuss here any fast algorithm of updating the block dictionary structure by adding, deleting, or changing a few records. In the context of database indices management, there is much literature on updating similar structures such as B-Trees and their variants (Gusfield, 1997). However, unlike the indices of the databases, the morphological dictionaries do not change frequently, so that we do not need to develop efficient algorithms for updating the block structure.

The simplest way of updating the structure is to export it into a sorted list D, then merge in, remove, or change some records, and then create a new block structure out of the updated list. Since both the exporting and building algorithms are fast and one-pass, this way can be quite affordable.

There is a practical alternative to complete restructuring of the block dictionary in case of minor changes. The idea is in leaving some free space in the blocks while creating the structure. Then adding new records will result in reconstructing only a limited number of blocks. The possible need in duplicating the new records into some blocks complicates things, though in the case of minor changes, the number of changing blocks still remains much less than the total size of the dictionary. Here we do not discuss this algorithm, see (Gelbukh, 1995).

5 Applications: morphological decomposition and spelling correction. Approximate prefix search

Decomposition and error correction – or approximate matching – tasks arise in nearly any search application of the type we consider. However, our main motivation here is again natural language analysis.

In particular, our algorithm of approximate prefix search will be described in tight integration with decomposition algorithm. This gives the solution to the task approximate decomposition, i.e., approximate matching with a string that can be decomposed by the given set of dictionaries.

The role of this section is mostly illustrative since it shows how our structure can be used for typical tasks of string processing. Because of the marginal role of this section and relative complexity of the algorithms, the description of the latter will be quite sketchy.

5.1 Decomposition of a string

Let us consider several string lists . By decomposition of a given string s, we mean finding such a set of substrings that their concatenation is s: s = x₁ ... x_n. If this can be done in several different ways, all such sets are to be found. Such decomposition is the main operation in morphological analysis of natural languages: Spanish habl-ába-mos, German kommunikation-s-technik. The specifics of this task for natural languages is in that at least (and usually exactly one) one list of substrings is very large, namely, the stem dictionary.

We view the decomposition task as sequential application of the substring search algorithm discussed above. First, initial substrings of s are found in . For each of them, initial substrings of the rest of s are looked up in , etc. After the last dictionary is analyzed, only the variants of analysis are accepted covering the entire string s. Actually, in natural language analysis, the latter condition is slightly more complicated: the variants of analysis are accepted for which after the last piece, a symbol goes in the text that can be a word delimiter, such as a comma or period.

This method – analyzing the rest of the string s for each initial substring found in , ..., – effectively implements a kind of backtracking.

5.2 Approximate prefix search. Spelling correction

By error correction, we mean to solve the decomposition task not for the given string s but for some another string that differ from s in a specific way. To put it in other words, the task is to find such strings that differ from s in a specific way and that can be decomposed by the given set of dictionaries .

Our algorithm does not heavily depend on the string similarity criterion being used. In our implementation of a morphological analyzer, we used so-called single letter error, which is defined as either mutilation of one letter, or omission of one letter, or insertion of one letter, or swapping two neighboring letters. In addition, we took into account some other types of errors such as transposition of two vowels around one consonant or two consonants around one vowel, as in *comtupational.

In the rest of the article, we will rely on a possibility for any given values of a string s, position p, and letter l to generate all strings similar to s according to the criterion being used, such that all of them have the letter l on the position p and coincide with s by the initial letters. For example, for a string *comtupational, the letter “p,” position 4, and the criterion mentioned above, the following strings are generated: *compupational, *comptupational, and computational (the latter one by swapping two consonants around one vowel).

5.2.1 Minimization of the number of the hypotheses tried

Since we consider the similarity criterion as a generating procedure, our algorithm will form and try some hypotheses for the variants of the string s. There are several possible ways to measure the performance of an error correction algorithm. Let us consider the following task: to find all possible variants of the given string s that can be decomposed by the given set of dictionaries trying as few hypotheses as possible. We will divide the sketch of our algorithm here in three parts, starting from a simple case and then adding more complications.

5.2.1.1 One dictionary: approximate search

First, let us consider a simpler task: to find exactly the variants of s in only one dictionary D. Our algorithm takes advantage of the possibility for alphabetic place of a variant string to predict the following position and letter to be tried.

Let limit = 1 + length of maximal common initial substring of s and + 1.

For each position p starting from 1 up to limit do:

Set letter to “a”.

Repeat until letter is less or equal to “z”:

Form a set of hypotheses that have the letter at the position p.

Order this set alphabetically.

For each hypothesis do

If = then report a possible variant.

If this is the last hypothesis then

Let = + 1.

If (1, p – 1) ¹ (1, p – 1) then break the loop and go to next p.

If (p) ¹ ( p) then set letter to (p), else increase letter by 1.

Fig. 1. Algorithm of error correction, version 1.

We try the positions p of the string s one by one, starting from p = 1, see Fig. 1. In the exhaustive search, at each position p the letters from a to z would be tried to form the hypotheses by our generative similarity criterion. However, in our algorithm not all letters have to be really tried. For a hypothesis , let us consider a = + 1: the key following the alphabetic place of the current hypothesis.^[10] Let us denote by w (u, v) the substring of a string w starting at the position u and ending at v, and w (u) the letter in w at the position u. We will count the positions from 1, so that w (1, v) is the initial substring of w with the length v.

Lemma7a. If a (1, p – 1) = (1, p – 1) and a (p) ¹ (p), then there is no such that x (1, p – 1) = (1, p – 1) and (p) < x (p) < a (p).

Proof: By definition of alphabetic order, for the hypothetical , < x < a = + 1, i.e., < x £ , that implies < , which is impossible by the definition of . à

Thus, in the case of a (1, p – 1) = (1, p – 1), there is no reason to try all the letters until a (p), so the next set of hypotheses is formed with (p) = a (p). This significantly decreases the number of hypotheses to be tried.

Lemma 7b. If a (1, p – 1) ¹ (1, p – 1), then there is no such that x (1, p – 1) = (1, p – 1) and (p) < x (p).

Proof: The same as for Lemma 7a, since < a. à

In this case, the entire rest of the alphabet can be skipped. Finally, if a (1, p) = (1, p), there is no information to skip any letters. For best results, the hypotheses generated by the similarity mechanism for a particular position and letter are to be tried in alphabetical order. In this case, the hypothesis that satisfies the conditions of the two lemmas is with a greater probability the last one; this simplifies the algorithm.

When the process stops? So far, we considered s unlimited, so there is no last position in it. This problem is easy to solve. The last position p to try is one plus the length of the maximal initial part common for s and + 1. By Lemma 7a, there are no strings in D having a larger initial part in common with s.

5.2.1.2 Many dictionaries: approximate decomposition

The described procedure works with only one dictionary. Now we can return to the case of decomposition of a string by several dictionaries .... We will not discuss the algorithm in detail, but the idea is as follows. If in a dictionary , the next key a is an initial substring of and therefore the conditions of no lemma above are satisfied, then the decomposition algorithm normally proceeds to the next dictionary , etc., until some difference between = + 1 and the corespondent rest of is found. The position p is considered relative to rather than to ; the rest of our algorithm remains the same. Effectively, a Cartesian product is considered as a single dictionary D, and the algorithm described above is applied to it.

5.2.1.3 Block boundaries

Finally, let us discuss one more complication. As we have proposed in a previous section, for locality of access, each of the dictionaries is stored in the form of blocks , such that only one such block is fetched from the main storage device when the dictionary is searched. What to do then if the next key = + 1 is not located in this block, but in the next block ? Fetching that next block is not desirable. Fortunately, it is not necessary.

Namely, the necessary information can be found in the index , or, more precisely, in the index of some level t. Really, if is located at the boundary of the blocks, then the initial part of + 1 of just the necessary length^[11] is equal to that of + 1. If, in its turn, is located at the boundary of the blocks in , then + 1 can be used, etc. until the last level index that has no block boundaries. Note that neither Fig. 1 nor Fig. 2 below reflect the complications just discussed.

Thus, we have discussed an algorithm of decomposition of a string by a set of dictionaries with error correction. The number of the hypotheses tried is substantially less than exhaustive search. For each hypothesis, only one block is fetched from the very first dictionary, which in natural language analysis is the very large stem dictionary). For other dictionaries ..., only one block is accessed at each corresponding step of backtracking in the decomposition process.

5.2.2 Minimization of the median number of storage accesses

Now let us discuss a different parameter being minimized. Suppose we have a procedure (or just a human user) that, for each hypothesis, can decide whether a corrected string for the string s is to be accepted and the search process is to be stopped. We will minimize the time of the work of the algorithm until it finds the correct variant.

Set string a = + 1.

Set number start = 1 + length of the maximal common initial substring of s and a.

For each position p starting from start down to 1do:

Set letter to a (p).

Repeat while letter ¹ s (p):

Form a set of hypotheses that have the letter at the position p.

Order this set alphabetically.

For each hypothesis do:

If = then report a possible variant.

If this is the last hypothesis then

Set = + 1.

If (1, p – 1) ¹ (1, p – 1) then reset letter to “a”.

If (p) ¹ (p) then set letter to (p), else increase letter by 1.

Fig. 2. Algorithm of error correction, version 2.

We do not make any assumptions about the nature of the mechanism that causes the errors in the strings s and thus about the nature of the procedure that tests the variants. Instead, we will just organize the search in such a way that most of the variants are found at the very early stage of its work, and the most of the work that is scarcely result in new variants is done at the later stage.

For a mathematical measure of grouping the variants, we chose the median: the time when half of the total number of variants is generated. For example, let first 10 variants of error correction in s are generated in the first second and then other 10 variants in 5 minutes. Then the total time of work of the algorithm is 301 seconds, average is 15.05 seconds, but the median time is 1 second. This is the behavior that we want, in comparison with an algorithm that produces a variant each 15.05 seconds; the median time of such an algorithm being 150.5 seconds. In fact, the algorithm described in the previous section has the latter type of median time.

In accordance with the data locality principle, we suppose that the most time-consuming operation is fetching a block from the dictionary. Thus, what we want to minimize is the number of blocks fetched until the correct variant is found. For this, we should try to first examine the block in which the most variants are located. Fortunately, with our dictionary structure, most of them are located in the block containing . What is more, this block probably is already fetched into memory at the moment of error correction, because to find that error correction is necessary, first the string s was searched for (and not found) in D.

Only a minor correction to the algorithm described in the previous section is necessary, see Fig. 2. To look in first, we start trying the hypotheses not from the first position p = 1, but instead from the last position in s, and advance in decreasing order. More precisely, since we consider s unlimited, really the first position p to try is the first position that is different in s and the string at the position + 1 in D, as it was discussed in the previous section (there it was the last position to try).

For even better locality, the letters in each position p are tried not in the order from a to z, but from s (p) + 1 to z and then from a to s (p) – 1. This increases a little bit the chances to find the correct variant in the same block where is located.

With this modification of the algorithm described in the previous section, practically without any increase of the number of the hypotheses tried, the median number of the blocks accessed during the search decreases dramatically.

5.2.3 Experimental results

We have implemented the discussed algorithm for morphological analysis of natural language in AMORE, a system for morphological analysis and synthesis (Gel'bukh, 1991; 1992). It has an 8-year history of commercial use with Russian (Gelbukh, 1995) and some other dictionaries, such as English and Spanish.

Russian is a highly inflective language with multiple-morpheme word structure and many fusion effects (internal sandhi) discussed above, such as prevozmo-ch’ – prevozmog-u – prevozmozh-esh’.^[12] Our dictionary consisted of 100 thousand lexemes, which resulted in about 200 thousand dictionary entries (records). Each record consisted of a compressed stem and the corresponding morphological information. In decomposition, we used one general stem list and up to four (for verbs) short lists of endings. The main stem list was indexed in three levels of , as described in Section 3.1.

A typical result of error correction is as follows: for the mutilated form *prevosmozhesh’, the complete search process took 5 seconds, while the only variant prevozmozhesh’ was found in 0.05 second.

6 Conclusions

A data structure that guarantees one memory block access per prefix search query with a very large dictionary was introduced.

The following tasks have been discussed:

1. Prefix search in a very large dictionary: finding all records in the database whose keys are initial substrings of a given unlimited string s,

2. Given the unlimited text string s, separating its leftmost “word” w that is defined as an initial substring of s that can be decomposed into concatenation of substrings w = x₁ ... x_n, x_i Î D_i for a set of dictionaries ....

3. Actually decomposing w into concatenation of elements (allomorphs) belonging to ...

4. Approximate search and decomposition: solving the above tasks approximately, for a given criterion of similarity between strings.

These tasks constitute the main part of the morphological analysis and spelling correction algorithms. Since with our algorithm, the leftmost word (stem) is found automatically, a stream of letters without boundaries between words, like Japanese text or DNA structure, can be thus decomposed into words.

Our variants of alphabetic search and spelling correction algorithms are optimized for data locality requirement that arises when data are accessed in chunks (blocks: disk sectors, memory pages, or network packages) and the cost (time) of data access is proportional to the number of chunks accessed rather than number of bytes used. In particular, this is the situation with virtual memory under the widely used operating systems such as Windows and UNIX.

The structure of dictionary was discussed, which similar to a 2-level B-tree. At the cost of slight redundancy of storage, it was guaranteed that only one block of dictionary is fetched per query, in spite of the inherent unlocality of the task.

As an illustration of practical usefulness of the proposed structure, algorithms of its building, exporting, searching, and searching with spelling correction were sketched.

The mentioned duplication of a very small percentage of records for sake of guaranteed one memory block access per prefix search query is a novel technique not previously discussed in the theory of B-trees.

References

Aho, Alfred V. “Algorithms for finding patterns in strings”, J. van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 5, pp. 254-300. Elsevier Science Publishers B. V., 1990.

Frakes, W., and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.

Bayer, R., and K. Unterauer. “Prefix B-Trees”, ACM Trans. Database Systems 2., p. 11-26, 1977.

Cassidy, P. “An Investigation of the Semantic Relations in the Roget’s Thesaurus: Preliminary Results”, A. Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, Fondo de Cultura Económica, Mexico, to appear in 2001. See also Proc. of CICLing-2000, February 13 to 19, 2000, CIC-IPN, Mexico City, ISBN 970-18-4206-5.

Comer, Douglas. “The Ubiquitous B-Tree”, Computing Surveys 11 (2): 121-137, 1979.

Cooper, W.S. “The storage problem”, Mech. Translat., 1958, p. 74-83.

Diccionario. Diccionario de la lengua española. Real academia española, vigésima primera edición, 1992.

Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.

Gelbukh, A.F. An efficiently implementable model of morphology of an inflective language. Ph.D. thesis, VINITI, Moscow, Russia, 1995.

Gel'bukh, A.F. “Effective implementation of morphology model for an inflectional natural language”, Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 26, N 1, 1992, pp. 22-31.

Gel'bukh, A.F. “Minimizing the number of memory accesses in dictionary morphologic analysis”, Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 25, N 3, 1991, pp. 40-45.

Gusfield, Dan. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997; ISBN: 0521585198.

Hausser, Ronald. Foundations of Computational Linguistics. Man-Machine Communication in Natural Language. Springer-Verlag, 1999.

Johnson, Theodore, and Dennis Shasha. “B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than Merge-at-Half”, JCSS 47 (1): 45-76 (1993).See other publications at http://www.informatik.uni-trier.de/~ley/db/access/btree.html.

Knuth, Donald. The Art of Computer Programming: Sorting and Searching, Vol 3, 2nd Ed, Addison-Wesley, 1998.

Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production, University of Helsinki, Department of General Linguistics, Publications, N 11, 1983, 160 pp.

Lenat, D. B. and R. V. Guha. Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley, 1990. See also more recent publications on CYC project, http://www.cyc.com.

Yuret, Deniz. Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT, 1998. See http://xxx.lanl.gov/abs/cmp-lg/9805009.

Zaliznyak, A. A. Grammatical dictionary of Russian. Word formation (in Russian). Russkij Jazyk, Moscow, Russia, 1987, 878 pp.

^* A revised version of the paper A Data Structure for Prefix Search under Access Locality Requirements and its Application to Spelling Correction, Proc. of MICAI-2000, Mexican International Conference on Artificial Intelligence, April 10-14, 2000, Acapulco, Mexico, ISBN 970-18-4465-3. The paper was selected for publication in the journal as one of the best papers of MICAI-2000. Work done under partial support of REDII, CONACyT and SNI, Mexico.

^[1] The hyphen at the end of the stem is shown just for readability and is actually not a part of the stem.

^[2] The meaning of the Spanish wordforms is not important for the discussion, so we do not give any glosses.

^[3] We will explain below how we deal with letter alternations such as accents.

^[4] Though switching from Kana letters to Kanji hieroglyphs in many cases indicates the word boundary, still the problem of the text flow segmentation into words is more difficult in Japanese than in, say, English.

^[5] English examples: in order to, each other, New York; Russian example: vo chto by to ni stalo ‘by all means’.

^[6] Here we do not discuss the CPU time of analysis since it depends solely on the speed of the CPU and not of the data storage device.

^[7] By alphabetic place of a given string s in the alphabetically ordered list D, we mean the position to which s could be inserted into D, see Definition 2 below.

^[8] This notation is motivated by C++ programming language where an array D of strings can be indexed with a string s, which is denoted as D [s].

^[9] ‘To overcome,’ ‘I will overcome,’ ‘you will overcome.’

^[10] To avoid too complicated notation, we use interchangeably the index of a string in D and the string itself when this does not cause any confusion.

^[11] This can be easily proved by analysis of the definition of the indices .

^[12] We do not provide any glosses since the meaning of the words is irrelevant for our discussion.