APPROXIMATE INITIAL SUBSTRING SEARCH

Abstract. A data structure useful for finding all records in a database whose keys are initial substrings (of unknown a priori length) of a given unlimited string is discussed. This problem is important for morphological analysis of inflective languages, including particularly difficult cases such as German word concatenation or Japanese writing system that does not use spaces between words. The data structure is optimized for locality of access: to find all necessary records, access to only one block (page) of the main data storage is guaranteed. To illustrate its usefulness, the algorithms of formation of the structure, search, and approximate matching search (spelling correction) are described.

Keywords: approximate string matching, initial substring matching, morphological analysis, spelling correction, natural language processing.

1 Introduction^*

Typically, a database management system is a device that can answer a simple question: Which records have the key exactly equal to the given string? However, in some cases we are interested in another question: Which records have the key that is an initial substring (of unknown a priori length) of the given string?As we will see, this problem has some specificity and raises some technical issues that the classical task of exact key matching does not face, specifically, the problems related to the data access locality.

In this article, we will discuss these issues, propose a data structure optimized with respect to them, and illustrate its usefulness by a sketch of the algorithms of dictionary building and searching. As an advanced use of the proposed structure, we will discuss its use for the task of spelling correction, i.e., approximate matching search.

1.1 Initial substring query

Our main motivation was morphological analysis of words in highly inflective languages such as Spanish, French, or Russian. In natural language morphology, an example of the task is the following. Let us consider a dictionary (a database) of Spanish stems^[1] like the one shown in Table 1.

Table 1. A fragment of Spanish dictionary.
Key	Value
.....	.....	.....
clar-	adjective	claro
co-	prefix	co-
com-	verb	comer
con-	preposition	con
concentr-	verb	concentrar
const-	verb	constar
constancia-	noun	constancia
constante-	adjective	constante
constat-	verb	constatar
constelación-	noun	constelación
constipad-	noun	constipado, -a
constru-	verb	construir
construcción-	noun	construcción
constructiv-	adjective	constructivo
constructivismo-	noun	constructivismo
consult-	verb	consultar
.....	.....	.....

In fact, the dictionary is much denser [7]; we have omitted some lines for brevity. Spanish words have highly inflected forms. For example, the stem^[2]constipad- gives rise to the wordforms constipado, constipada, constipados, constipadas; the stem clar- to the wordforms claro, clara, claros, claras, clarísimo, clarísima, clarísimos, clarísimas, claramente; the stemconstru- to about 700 wordforms such as construir, construyo, construido, constrúyanmelo^[3], etc.

Thus, a natural task arises: to find in the dictionary the records whose keys are initial substrings of, say, the string constructivamente.

By records, we mean all records and not just the longest one. Indeed, here is an example where the longest key is not the right one. Suppose the dictionary contains such stems as ajen- for derecho ajeno, moren- for color moreno, escalen- for triángulo escaleno, etc. Then, for the verb forms like que ellos ajen algo, moren, escalen algo, the stems mentioned above are the longest ones, while the shorter stems aj-ar, mor-ar, escal-ar are the correct ones. Since making any linguistically meaningful decisions is not the business of the string storage mechanism, we conclude that the dictionary search procedure should enumerate all the records it found, starting from the longest one since it is with higher probability the right one.

Though the examples of stem ambiguity are rather rare in Spanish, they are much more frequent in languages with more developed morphological inflection, such as Russian, Finnish, Turkish, just to mention a few. In German, compound words are very frequent, so that one word can consist of many stems: for example, the first stem of kommunikationstechnik is kommunikation-. The extreme situation is, say, in Japan writing system which just does not use spaces between words. In this case, the query string for the discussed database engine to find the stems in is just the whole text, since there is no a priori clue on where the word ends.

There is another good reason to consider the whole unlimited text as the query string to search the stems in, even in the languages that do mark word boundaries. In all such languages, the word delimiters – spaces for simplicity – are sometimes overused. For example, a Spanish preposition a través de contains two spaces and thus looks like three words, though there are no technical reasons to treat it so. Thus, even in Spanish, like in Japanese or German, we can not easily detect the word boundaries! Fortunately, in order to use the search mechanism discussed in this paper, we really do not need to: it is enough to include the word a través de in the dictionary, with its both spaces, as one record key.

Thus, if we consider the task of finding in the database the initial substrings of a given unlimited text, we do not need to split the text a priori into individual words.

Related work. There is a vast literature on the prefix-matching data structures, such as tries, B-Trees, Prefix B-Trees, B+-Trees, extendible hashing, etc. [1, 5, 12]. The theory of such structures is developed in the context of the database index management rather than on dictionary application; thus, these structures and the corresponding algorithms are optimized with respect to efficient updating (insertions and deletions), which is rather irrelevant for our case. On the other hand, these structures are not optimized for the access locality (see the next section) that is the main topic of our paper. Actually, what we propose is similar to a 1-level B-tree (that we then extend to an n-level B-Tree) with the additional feature of repetition of some keys, which optimizes the structure with respect to data access locality.

In our paper, we will illustrate the usefulness of the suggested structure for spelling correction. The literature on approximate string matching is also huge [1, 2, 12]. We will consider a kind of a nearest edit distance spelling correction, with a set of editing operations specific to typographic errors in texts. However, our discussion of spelling correction is not aimed on proposing of a new approximate string-matching algorithm per se. Instead, we will describe a fast spelling correction algorithm that works with the same data structure that is used for exact string matching within an integrated morphological analysis system, without any modifications of the structure that would slow down the normal (exact) operation nor require additional storage space.

1.2 The data locality problem

Data locality means access to a small region of data storage per operation. Suppose you need to retrieve the name, surname, and age of a person from a file. If all you need to do is to read a line 1234 where this data is stored then this operation is local. On the other hand, if you need to read the name from the line 678, surname from 901, and age from 2345 then this way of data retrieval is not local since you have to address several different regions of data storage to fulfill one operation.

With most of data storage devices, local access is much faster than non-local access is. The most obvious example is a tape where the access time is proportional to the distance between the addressed locations. Disk storage, be that a hard disk or a CD, is less sensible to data locality but still sensible: in this case, the access time is proportional to the number of addressed locations, so that in the example above, the second variant is trice slower than the first one. This is because the data are exchanged with a disk storage device by blocks, or sectors, of fixed size. Reading, say, a kilobyte of sequential data takes nearly the same time as reading one byte, but repositioning the reading head to another location takes significant time.

It might be argued that large amount of available memory renders the data locality problem unimportant since the cost of addressing to the random access memory (RAM) is proportional just to the number of retrieved bytes regardless to their location. However, this is not completely true. Under the operating systems widely used nowadays, like Windows 98 or NT, the large (virtual) memory is much “less random access” than in old good times of DOS. Really, most of the physical memory is occupied by the active concurrent programs and the operating system itself, while random parts of any data loaded in memory are swapped out to the hard disk. Thus, under such systems, loading a dictionary into memory makes little difference from storing it on the disk. Simple experiments show that data locality problem keeps its importance even for data structures that “fit in memory” (while really is highly probably swapped out of it). On Intel processors, the quantum – so called page – of memory is four kilobytes; while sequential accesses within one page are local, access to a different page with some probability causes hard disk read and write operations.

The data locality problem is especially important for natural language analysis applications that use huge amount of information for the analysis of each phrase. While each individual dictionary – a morphological dictionary, syntactic dictionaries such as subcategorization and lexical attraction (collocation) [15] dictionaries, a semantic network dictionary [12, 1], world knowledge dictionaries [12], etc. – might fit in the physical memory, all of them together will not. Meanwhile, the processing of each phrase requires using of all of these resources in parallel, before the processing of the next phrase begins, and with tens of thousands of different words occurring in random order. Thus, natural language processing tasks are unique in their highly intensive memory using that causes intensive virtual memory disk swapping.

However, only morphological analysis faces the problem of data locality. All other dictionaries – syntactic, semantic, and world knowledge ones – can be stored in a classical database.^[4] Thus, not more than one data storage access per word is necessary for each of the dictionaries used but the morphological one. Therefore, the data locality problems with which the morphological analysis faces is the bottleneck of the data access during natural language analysis.

Returning to the search algorithms, let us consider an example of a data locality-friendly one and an example of one that is not.

· The classical hash table is data-local. To find the record with the key construcción, its hash value is calculated, which defines the address starting from which (at this address or at some near one in case of collision) the record is located. However, hash table is good for the classical database task of finding the exact keys, but it does not work for our task of initial substring query: to find the records whose keys are initial substrings of the given unlimited string, a potentially infinite number of hypotheses are to be tried.

· Binary search in an alphabetically ordered array works fine for the initial substring query task. The main search operation in this case is that of finding the alphabetical place of a given string in the list^[5]. However, this structure is not local: for a binary search in, say, a 1000 lines file, one needs to address 10 different locations in it.

With binary search, the problem is not only in a wrong algorithm. Initial substring query task for an alphabetically ordered list is inherently non-local. Indeed, let us consider the search procedure for the Spanish verb form consto. Its alphabetical place in the dictionary in Table 1 is after constelación-, while the true answer is const-, and what is more, according to the way we formulated the task, the records con-, co-, andc- are also to be retrieved. In [7], the distances between these locations are as follows: c- ¼9327 words¼co- ¼2101 words¼con- ¼1445 words¼const- ¼95 words¼constituyentes- (consto). Since these are the results of the query and thus any algorithm is to address these locations, no algorithm based on this data structure can be local.

In this article, we will discuss a data locality-friendly modification of alphabet search. Namely, by a slight redundancy, our data structure guarantees that after retrieval of exactly one block (page) of data from the main storage, all the records with the keys being initial substrings of the given string are found.

2 Dictionary structure

Hereafter we will usually speak of the keys or the records that have these keys interchangeably when this does not create any confusion. For example, we will say “list of keys” while meaning “list of records having these keys.”

Let us start from an alphabetically ordered list of keys D since this is a natural data structure for our task of finding initial substrings. As we have seen, there are two locality problems with this list. First, even if we only are interested in finding the alphabetical place of a string s in D,binary search is not a local algorithm. Second, non-locality is an inherent property of such a list if we are interested in finding the initial substrings of s, especially all of them rather than only one. We will consider these two problems separately.

2.1 Algorithm locality: index of blocks

Definition1. Alphabetical place D [s] of a string s in D is the number of the records such that r £ s, i.e., the position of the key immediately before the place that s had if it were inserted in D. In an appropriate context, we will understand by D [s] the key itself rather than its position.

We can consider D to be split into segments, or blocks, corresponding to the physical storage units such as disk sectors, memory pages, network packages, etc.:^[6] . Let us consider the task of finding the alphabetical place D [s] of a given string s.

Binary search is not a local algorithm. However, it can be easily improved using an index of the blocks.

Let us denote the first key of the block as , and the last key as , so that the block is a segment (in the sense analogous to the geometrical one) = [, ] Í D.

Let us consider a set of keys , where is the first key in .

Theorem1. The alphabetical place of a string s in D is located in the block with the number I [s]: .

Proof: . à

Definition2. By x y, we denote that x is an initial substring of y.

Thus, the necessary block can be found by means of search in the index I. We consider I to be small enough so that the data locality problem is not applicable to it. What is more, it can be made even smaller by removing redundant letters of some keys that do not change the alphabetical place of any string. Namely, let us replace each element by its minimal initial substring that is not an initial substring of the last key of the previous block : , , and is the minimal with this property. Let .

Lemma 1. For any x, y, z, the following conditions are incompatible: , , , and .

Proof: Suppose the first three conditions hold. Since and , then . Since , , and , then by definition of the lexicographic order, . à

Theorem2. The alphabetical place of a string s in I is the same as in : .

Proof: Let i = , j = . First, since and , then . Second, since , , and , then, by Lemma 1, . Thus, < , that means that i = = j. à

Thus, we can use an even smaller index to find the necessary block. First, the place of s is found in , then the block with the corresponding number is retrieved, and then s is found in by any suitable method. Ignoring the problem of locality of search in a small list , we can say that this algorithm of finding the alphabetical place of s in D is local.

In practical implementation, for simplicity we implemented the search for s in by using a second level index: we split into blocks and compiled an index of their first keys. We applied this procedure recursively, splitting that second level index into blocks and compiling a third level index , which in our case consisted in only one block and thus did not need in any index.

2.2 Data locality: block prefix

Though the problem of finding is local, the problem of finding an , is not. First, x can be located far from . Second, if we are interested in finding all such keys, they can be located quite far from each other, as we have seen with the keys co-, con-, const-, etc. Such keys are probably located in different blocks .

What we propose is to add to each block all keys that can be necessary for any query addressing this block. As we will show, the changes will result only in appending some small amount of information at the beginning of the block. We call this additional piece of information a block prefix.

Namely, for any s such that , for any such that , we propose to duplicate the key x with its corresponding record to the block B. Probably some records of D will be moved to other blocks since the size of the block is fixed; thus this operation can cause changing the way D is split into blocks and increase the number of blocks. However, with this operation we reach our goal: by retrieval of only one block , namely, such that , we find in it, locally, all such keys that .

Does this cause a significant increase of the dictionary size? Does this result in complicated algorithms of search and of dictionary building? Let us examine closer the set of keys that are to be added to B and the way D is now split into .

Definition 3. A set of strings S Í D is closed if "x Î S and "y Î D, if then y Î S.

Definition 4. The closure of a set S Í D is the minimal closed subset of D that contains S.

Lemma 2. Definition 4 is correct, i.e., there exists exactly one minimal closed subset of D that contains S, namely, = {y Î D | $x Î S such that }.

Proof: Obviously, this set is closed and contains S, since for any x, . On the other hand, it is contained in any other set with these two properties. à

Now we can formalize our solution. Let us split D into a set of segments (which are not necessary the same that we discussed before), such that each closure , rather than the block itself, has the size corresponding the physical storage unit. All the statements of the previous sections hold for this segmentation of D.

Let this segmentation of D be minimal, i.e., splits D into a minimal possible number of blocks. We propose to use as our main data structure (instead of the original set of blocks ). Now we will show that this structure is only insignificantly larger than the original set D: S || » |D|.

Lemma 3. For any s and any , if , then .

Proof: By definition of , if and , then . If x = , then , else, by Lemma 1, . à

Lemma 4. For any block B Í D and any, the following conditions are equivalent:

· such that and ,

· such that .

Proof: By Lemma 3. à

Thus, the set of keys to be added to a block B is exactly the set of all initial substrings , of all keys . What is more, the keys to be actually added are just initial substrings of the very first key since the other ones are already in B. By b < a we denote the alphabetical order.

Lemma 5. Let b be the first key of a block Í D, x Î D, . Then if such that , then .

Proof: Since x Î D, , and x Ï B, then x < b. Since b £ a and , then, by Lemma 1, . à

Thus, finally, the only keys to be duplicated into the block in order to guarantee access locality are the initial substrings of the very first record of the block:

Theorem 3. For any segment S = [b, e] Í D, = S È{y Î D | , }.

Proof: By Lemma 5. à

Thus, each closed block is obtained from the correspondent block by adding before its first record all records from D whose keys are initial substrings of only one, namely the first one, key of . Clearly, there are not many such records in D.

For example, for a block of Spanish dictionary starting with the stem constat- for constatar, only the records with the keys const-, con- and co- are duplicated into the block, no matter how large it is. In our experiments with morphological dictionaries, there are about 2 to 3 records in average added to each block. In our case, 1-kilobyte blocks contained about 30 records in average, so the redundancy was about 10%.

With such a small redundancy, we guarantee the complete locality of data: any initial substring query is processed with retrieving of exactly one block of the main data storage.

3 Building, exporting, searching

In spite of simplicity of the formula for the duplicated records, its application sounds like vice circle. Really, the duplicated records depend on the first records of the blocks, while the block layout itself depends on such duplication, since now it is the size of that is fixed, not that of . However, we propose simple algorithms for building and exporting such a data structure.

3.1 Building

Suppose we have an alphabetically ordered list D of records and need to compile the dictionary in the block format discussed above: to split D into a minimal set of blocks so that and for each block, the size of its closure , where C is a constant, say, 4 kilobytes.

Let us first consider an auxiliary data structure that we call a stack of initial substrings. This is not exactly a LIFO stack, but is similar to it. The stack of initial substrings S has an underlying LIFO stack of records of D, and implements one operation, a kind of push, with the following behavior. When a record with the key s is pushed onto S, then the records are popped out of until the record with a key is found on the top of or is empty. Then, the new record is pushed onto .

Let the records of the ordered list D are read one by one from D and pushed onto S. At the beginning, S is empty.

Theorem4. At each step of this process, after is processed, S is ordered alphabetically from bottom to top and contains exactly all records whose keys are initial substrings of s: S = { | }.

Proof: First, since the new records are pushed onto (after some pop operations), then for each record, only the records that came before it are deeper than it in . Since the records come in alphabetical order, is ordered alphabetically. Second, when a record is pushed onto , the key immediately below it is its initial substring; then, contains a chain of initial substrings; in particular, Þ . On the other hand, if and , then , so that x have come at some step and was pushed onto . However, at no step was it popped out of . Really, if it were popped out by some record y that came after x, then x £ y, , y £ s, and , which is impossible by Lemma 1. à

The algorithm of formation of the dictionary works as follows.

· At the beginning of work, an empty block is created.

· During the work cycle, the records are read one by one and pushed onto S. After that, an attempt is made to append s to the block being currently formed. If there is any block compression as discussed below, it is performed. The new potential size of the block is estimated. If still , then the process repeats with the next record of D.

· However, if with the new record the size would exceed C, then the block , without the new record s, is ready. It is appended to the dictionary structure being formed, and a new empty block is created. The contents of the stack S is copied to from bottom to top, thus making the block closed. With this, s, which is currently on the top of S, becomes the first record of the new “logical” (not closed) block , and contains all the records whose keys are initial substrings of s and is ordered alphabetically. After that, the process repeats with the next record of D.

In our implementation, at the moment of formation of the new block , we dump the key of its first record to compile later the index . The key is truncated according to the definition of (see Theorem 2) since the last record of the previous block is known at the moment of formation of .

Also, with each block we kept a header containing only two numbers: (1) the total number of the records in the block and (2) the size of the block prefix, i.e., the number of the redundant records duplicated into the block, which is at the moment of creation of the new block .

We did not mention here a special treatment for the first and last blocks. Another complication is the special case of records with equal keys. This case can be ignored if we prohibit equal keys in D merging the values of such records when necessary. Alternatively, special precautions are to be taken for a group of records with equal keys not to be split across block boundary.

The described algorithm is single-pass.

3.2 Exporting and updating

Now let us consider the reverse task: from the block structure , restore the ordered list of records D. This task is rather trivial. In our program, each block contains a header with the number of the redundant records in it, so that it is enough to dump out the contents of each block, from first to last record, ignoring the first records of each block. Even if the number were not kept with the block, it would be easily restored by looking for the first record such that a > e where e is the last record of . This procedure is also a one-pass algorithm.

We do not discuss here any sophisticated algorithm of updating the block dictionary structure by adding, deleting, or changing a few records. In the context of database indices management, there is much literature on updating similar structures such as B-Trees and their variants [12]. However, unlike the indices of the databases, the morphological dictionaries do not change frequently, so that we do not need to develop efficient algorithms for updating the block structure.

The simplest way of updating the structure is to export it into a sorted list D, then merge in, remove, or change some records, and then create a new block structure out of the updated list. Since both the exporting and building algorithms are fast and one-pass, this way can be quite affordable.

There is a practical alternative to complete restructuring of the block dictionary in case of minor changes. The idea is in leaving some free space in the blocks while creating the structure. Then adding new records will result in reconstructing only a limited number of blocks. The possible need in duplicating the new records into some blocks complicates things, though in the case of minor changes, the number of changing blocks still remains much less than the total size of the dictionary. Here we do not discuss this algorithm, see [9].

3.3 Search and block compression

We have already discussed the search algorithm. Let us consider a string s; we want to find all such that .

Suppose we already have an algorithm A to find the alphabetical place [s] in a given block. We start from the highest level of index . With A, we find the number , and retrieve the block with this number from . Then we repeat the search in this block, extract the corresponding block from , etc., until , and finally retrieve from the main storage the block that contains D [s] and thus all such that .

The description of the algorithm A is not essential for the present article since it does not affect the issue of locality. However, we will discuss it briefly because of its importance. The algorithm depends on the internal representation and possible compression of data in the block.

Search in a compressed block. Generally, data compression is crucial for our algorithms because of the fixed size of the block. Really, as we have seen, additional overload to each block – the block prefix size – does not depend on the block size. Thus, the percentage of redundancy depends on the size of block. If the size of block prefix is comparable with the total block size, our algorithms are rendered unaffordable.

In our blocks, data is ordered alphabetically, so that the keys with common initial substring are located together. To compress such data, Cooper compression works well [6]: instead of an initial substring equal to that of the immediately previous key, only the number of common letters is indicated. For example, the keys in Table 1 are compressed in the following way: 0/clar-, 1/o-, 2/m-, 2/n-, 3/centr-, 3/st-, 5/ancia-, 7/te-, 6/t‑, 5/elación-, 5/ipad-, 5/ru-, 7/cción-, 8/tiv-, 11/ismo-, 4/ult-. The first key in the block always has the Cooper coefficient 0.

Such compression is especially useful for morphological dictionaries since in many languages there are variants of stems for the same word differing only in the last few letters. English examples are lady – ladi-es, stop – stopp-ed, Spanish traduc-ir – traduj-o – traduzc-a; indic-ar – indiqu-en; alcanz-ar – alcanc-en, Russian prevozmo-ch’ – prevozmog-u – prevozmozh-esh’^[7] [16].

Here we will not give any detailed comments on the search algorithm but only a short description; we again ignore the complications caused by the records with equal keys. The main personage of the algorithm is the common initial substring length e with the meaning of the length of the maximal common initial substring of the query string s and the current key being processed.

The keys are examined one by one. Initially, i = 0 and c = 0. Let the Cooper coefficient of the current key be c. If , the algorithm just proceeds to the next key . Otherwise, starting from the position c, the string s is compared with , and the new value for e is determined as the first position differing the two strings. During this comparison, two special situations can be found. If s < or the block is exhausted, the search process is over. If , one of the keys we are looking for is found. It is added to the set of the search results. In our implementation, the search results are stored in a LIFO stack, and each found key is pushed onto this stack. After the search is over, the results can be one by one popped out of the stack and examined; with this, they are examined in the order natural for most applications, i.e., from the longest to shortest.

Since the block contains the prefix with duplicated records, all the necessary records are found in this only one block.

Substring chains. The search method described above looks through all the records in the block until the alphabet place of the query string s is found. However, there are algorithms jumping directly to this position [s], for example, binary search. By storing some additional information in the dictionary, we can find all the initial substrings of s starting directly from [s]. For this, let us suppose that with each record in , a pointer to, or just the length of, the maximal such that is kept in the dictionary. Let be a chain of such maximal substrings.

Lemma 6. and , if , then .

Proof: By Lemma 3. à

Thus, to find all the initial substrings of s, the chain is to be looked through starting from [s] until an element found such that . Then the chain is the result of the query. For most applications, this chain is to be examined starting from x.

4 Decomposition and error correction

Decomposition and error correction tasks arise in nearly any search application of the type we consider. However, our main motivation here is again natural language analysis.

The role of this section is mostly illustrative since it shows how our dictionary can be used fop typical tasks of string processing. Because of the marginal role of this section and relative complexity of the algorithms, the description of the latter will be quite sketchy.

4.1 Decomposition of a string

Let us consider several string lists . By decomposition of a given string s, we mean finding such a set of substrings that their concatenation is s: . If this can be done in several different ways, all such sets are to be found. In natural language analysis this is called morphological analysis: Spanish habl-ába-mos, German kommunikation-s-technik. The specifics of this task for natural languages is in that at least (and usually exactly) one list of substrings is very large, namely, the stem dictionary.

We view the decomposition task as sequential application of the substring search algorithm discussed above. First, initial substrings of s are found in . For each of them, initial substrings of the rest of s are looked up in , etc. After the last dictionary is analyzed, only the variants of analysis are accepted covering the entire string s. Actually, in natural language analysis, the latter condition is slightly more complicated: the variants of analysis are accepted for which after the last piece, a symbol goes in the text that can be a word delimiter, such as a comma or period.

This method – analyzing the rest of the string s for each initial substring found in , ..., – effectively implements a kind of backtracking.

4.2 Error correction

By error correction we mean to solve the decomposition task not for the given string s but for some another string that differ from s in a specific way. To put it in other words, the task is to find such strings that differ from s in a specific way and that can be decomposed by the given set of dictionaries .

Our algorithm does not heavily depend on the string similarity criterion being used. In our implementation of a morphological analyzer, we used so-called single letter error, which is defined as either mutilation of one letter, or omission of one letter, or insertion of one letter, or swapping two neighboring letters. In addition, we took into account some other types of errors such as transposition of two vowels around one consonant or two consonants around one vowel, as in *comtupational.

In the rest of the article, we will rely on a possibility for any given values of a string s, position p, and letter l to generate all strings similar to s according to the criterion being used, such that all of them have the letter l on the position p and coincide with s by the initial letters. For example, for a string *comtupational, the letter “p,” position 4, and the criterion mentioned above, the following strings are generated: *compupational, *comptupational, and computational (the latter one by swapping two consonants around one vowel).

Minimization of the number of the hypotheses tried. Since we consider the similarity criterion as a generating procedure, our algorithm will form and try some hypotheses for the variants of the string s. There are several possible ways to measure the performance of an error correction algorithm. Let us consider the following task: to find all possible variants of the given string s that can be decomposed by the given set of dictionaries trying as few hypotheses as possible. We will divide the sketch of our algorithm here in three parts, starting from a simple case and then adding more complications.

One dictionary. First, let us consider a simpler task: to find exactly the variants of s in only one dictionary D. Our algorithm takes advantage of the possibility for alphabetical place of a variant string to predict the following position and letter to be tried.

Let limit = 1 + length of maximal common initial substring of s and + 1.

For each position p starting from 1 up to limit do:

Set letter to “a”.

Repeat until letter is less or equal to “z”:

Form a set of hypotheses that have the letter at the position p.

Order this set alphabetically.

For each hypothesis do

If = then report a possible variant.

If this is the last hypothesis then

Let = + 1.

If (1, p – 1) ¹ (1, p – 1) then break the loop and go to next p.

If (p) ¹ ( p) then set letter to (p), else increase letter by 1.

Fig. 1. Algorithm of error correction, version 1.

We try the positions p of the string s one by one, starting from p = 1, see Fig. 1. In the exhaustive search, at each position p the letters from a to z would be tried to form the hypotheses by our generative similarity criterion. However, in our algorithm not all letters have to be really tried. For a hypothesis , let us consider a = + 1: the key following the alphabetical place of the current hypothesis.^[8] Let us denote by w (u, v) the substring of a string w starting at the position u and ending at v, and w (u) the letter in w at the position u. We will count the positions from 1, so that w (1, v) is the initial substring of w with the length v.

Lemma7a. If a (1, p – 1) = (1, p – 1) and a (p) ¹ (p), then there is no such that x (1, p – 1) = (1, p – 1) and (p) < x (p) < a (p).

Proof: By definition of alphabetic order, for the hypothetical , < x < a = + 1, i.e., < x £ , that implies < , which is impossible by the definition of . à

Thus, in the case of a (1, p – 1) = (1, p – 1), there is no reason to try all the letters until a (p), so the next set of hypotheses is formed with (p) = a (p). This significantly decreases the number of hypotheses to be tried.

Lemma 7b. If a (1, p – 1) ¹ (1, p – 1), then there is no such that x (1, p – 1) = (1, p – 1) and (p) < x (p).

Proof: The same as for Lemma 7a, since < a. à

In this case, the entire rest of the alphabet can be skipped. Finally, if a (1, p) = (1, p), there is no information to skip any letters. For best results, the hypotheses generated by the similarity mechanism for a particular position and letter are to be tried in alphabetical order. In this case, the hypothesis that satisfies the conditions of the two lemmas is with a greater probability the last one; this simplifies the algorithm.

When the process stops? So far, we considered s unlimited, so there is no last position in it. This problem is easy to solve. The last position p to try is one plus the length of the maximal initial part common for s and + 1. By Lemma 7a, there are no strings in D having a larger initial part in common with s.

Many dictionaries. The described procedure works with only one dictionary. Now we can return to the case of decomposition of a string by several dictionaries .... We will not discuss the algorithm in detail, but the idea is as follows. If in a dictionary , the next key a is an initial substring of and therefore the conditions of no lemma above are satisfied, then the decomposition algorithm normally proceeds to the next dictionary , etc., until some difference between = + 1 and the corespondent rest of is found. The position p is considered relative to rather than to ; the rest of our algorithm remains the same. Effectively, a Cartesian product is considered as a single dictionary D, and the algorithm described above is applied to it.

Block boundaries. Finally, let us discuss one more complication. As we have proposed in a previous section, for locality of access, each of the dictionaries is stored in the form of blocks , such that only one such block is fetched from the main storage device when the dictionary is searched. What to do then if the next key = + 1 is not located in this block, but in the next block ? Fetching that next block is not desirable. Fortunately, it is not necessary.

Namely, the necessary information can be found in the index , or, more precisely, in the index of some level t. Really, if is located at the boundary of the blocks, then the initial part of + 1 of just the necessary length^[9] is equal to that of + 1. If, in its turn, is located at the boundary of the blocks in , then + 1 can be used, etc. until the last level index that has no block boundaries. Note that neither Fig. 1 nor Fig. 2 below reflect the complications just discussed.

Thus, we have discussed an algorithm of decomposition of a string by a set of dictionaries with error correction. The number of the hypotheses tried is substantially less than exhaustive search. For each hypothesis, only one block is fetched from the very first dictionary, which in natural language analysis is the very large stem dictionary). For other dictionaries ..., only one block is accessed at each corresponding step of backtracking in the decomposition process.

Minimization of the median number of storage accesses. Now let us discuss a different parameter being minimized. Suppose we have a procedure (or just a human user) that, for each hypothesis, can decide whether a corrected string for the string s is to be accepted and the search process is to be stopped. We will minimize the time of the work of the algorithm until it finds the correct variant.

Set string a = + 1.

Set number start = 1 + length of the maximal common initial substring of s and a.

For each position p starting from start down to 1do:

Set letter to a (p).

Repeat while letter ¹ s (p):

Form a set of hypotheses that have the letter at the position p.

Order this set alphabetically.

For each hypothesis do:

If = then report a possible variant.

If this is the last hypothesis then

Set = + 1.

If (1, p – 1) ¹ (1, p – 1) then reset letter to “a”.

If (p) ¹ (p) then set letter to (p), else increase letter by 1.

Fig. 2. Algorithm of error correction, version 2.

We do not make any assumptions about the nature of the mechanism that causes the errors in the strings s and thus about the nature of the procedure that tests the variants. Instead, we will just organize the search in such a way that most of the variants are found at the very early stage of its work, and the most of the work that is scarcely result in new variants is done at the later stage.

For a mathematical measure of grouping the variants, we chose the median: the time when half of the total number of variants is generated. For example, let first 10 variants of error correction in s are generated in the first second and then other 10 variants in 5 minutes. Then the total time of work of the algorithm is 301 seconds, average is 15.05 seconds, but the median time is 1 second. This is the behavior that we want, in comparison with an algorithm that produces a variant each 15.05 seconds; the median time of such an algorithm being 150.5 seconds. In fact, the algorithm described in the previous section has the latter type of median time.

In accordance with the data locality principle, we suppose that the most time-consuming operation is fetching a block from the dictionary. Thus, what we want to minimize is the number of blocks fetched until the correct variant is found. For this, we should try to first examine the block in which the most variants are located. Fortunately, with our dictionary structure, most of them are located in the block containing . What is more, this block probably is already fetched into memory at the moment of error correction, because to find that error correction is necessary, first the string s was searched for (and not found) in D.

Only a minor correction to the algorithm described in the previous section is necessary, see Fig. 2. To look in first, we start trying the hypotheses not from the first position p = 1, but instead from the last position in s, and advance in decreasing order. More precisely, since we consider s unlimited, really the first position p to try is the first position that is different in s and the string at the position + 1 in D, as it was discussed in the previous section (there it was the last position to try).

For even better locality, the letters in each position p are tried not in the order from a to z, but from s (p) + 1 to z and then from a to s (p) – 1. This increases a little bit the chances to find the correct variant in the same block where is located.

With this modification of the algorithm described in the previous section, practically without any increase of the number of the hypotheses tried, the median number of the blocks accessed during the search decreases dramatically.

Experimental results. We have implemented the discussed algorithm for morphological analysis of natural language [10, 11] and tested it on a Russian dictionary [9]. Russian is a highly inflective language, with multiple-morpheme word structure and many fusion effects (internal sandhi) discussed above, like prevozmo-ch’ – prevozmog-u – prevozmozh-esh’. Our dictionary consisted of 100 thousand lexemes, which resulted in about 200 thousand dictionary entries (records). In decomposition, we used one general stem list and up to four (for verbs) lists of endings. The main stem list was indexed in three levels of .

A typical result of error correction is as follows: for the mutilated form *prevosmozhesh’, the complete search process took 5 seconds, while the only variant prevozmozhesh’ was found in 0.05 second.

5 Conclusions

Two problems have been discussed: that of finding all the records in the database whose keys are initial substrings of a given unlimited string s, and that of detecting and decomposition of a the initial “word” of the string s into elements of a given sequence of dictionaries .... In the latter case, a stream of letters without boundaries between words, like Japanese text, can be decomposed into words.

We have discussed a variant of alphabetic search and error correction algorithms optimized for data locality requirement. Such a requirement arises when data are accessed in chunks – disk sectors, memory pages, or network packages – and the cost (time) of data access is proportional to the number of chunks accessed rather than number of bytes used. In particular, this is the situation under the widely used operating systems such as Windows.

The structure of dictionary was discussed. At the cost of slight redundancy of storage, it was guaranteed that only one block of dictionary is fetched per query, in spite of the inherent unlocality of the task. As an illustration of usefulness of the proposed structure, algorithms of its building, exporting, searching, and searching with error correction were sketched.

References

1. Aho, Alfred V. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 5, pages 254-300. Elsevier Science Publishers B. V., 1990.

2. W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.

3. Bayer R., Unterauer K. Prefix B-Trees. ACM Trans. Database Systems 2., p. 11-26, 1977.

4. Cassidy P. An Investigation of the Semantic Relations in the Roget’s Thesaurus: Preliminary Results. In: A. Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, IPN-UNAM, Mexico, to appear. See also Proc. of CICLing-2000, February 13 to 19, 2000, CIC-IPN, Mexico City, ISBN 970-18-4206-5.

5. Comer, Douglas. The Ubiquitous B-Tree. Computing Surveys 11 (2): 121-137, 1979.

6. Cooper, W.S. The storage problem. Mech. Translat., 1958, p. 74-83.

7. Diccionario de la lengua española. Real academia española, vigésima primera edición, 1992.

8. Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.

9. Gelbukh, A.F. An efficiently implementable model of morphology of an inflective language. Ph.D. thesis, VINITI, Moscow, Russia, 1995.

10. Gel'bukh, A.F. Effective implementation of morphology model for an inflectional natural language. Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 26, N 1, 1992, pp. 22-31.

11. Gel'bukh, A.F. Minimizing the number of memory accesses in dictionary morphologic analysis. Automatic Documentation and Mathematical Linguistics, Allerton Press, vol. 25, N 3, 1991, pp. 40-45.

12. Gusfield, Dan. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997; ISBN: 0521585198.

13. Johnson, Theodore, and Dennis Shasha. B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than Merge-at-Half. JCSS 47(1): 45-76(1993).See other publications at http://www.informatik.uni-trier.de/~ley/db/access/btree.html.

14. Lenat, D. B. and R. V. Guha. Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley, 1990. See also more recent publications on CYC project, http://www.cyc.com.

15. Yuret, Deniz. Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT, 1998. See http://xxx.lanl.gov/abs/cmp-lg/9805009.

16. Zaliznyak, A.A. Grammatical dictionary of Russian. Word formation (in Russian). Russkij Jazyk, Moscow, Russia, 1987, 878 pp.

^* Work done under partial support of REDII, CONACyT and SNI, Mexico. An earlier version of this paper was published under the title Approximate Initial Substring Search under Access Locality Requirements in the Proc. of CIC-99, Simposium Internacional de Computación, November 15 - 19, 1999, CIC, IPN, Mexico D.F., pp. 367-381.

^[1] The hyphen at the end of the stem is not a part of the record key, but rather is added in the printed form just for readability.

^[2] The meaning of the Spanish wordforms is not significant for the discussion, so we will not give any glosses.

^[3] We will explain below how we deal with letter alternations like accents.

^[4] The reader can recall that the difference between the classical database engine and the morphological one is that the classical one answers the question “Find the records whose key is equal to the given one” while the morphological one answers the question “Find the records whose key is an initial substring of unknown length of the given one.”

^[5] By alphabetical place of a given string s in the alphabetically ordered list D, we mean the position to which s could be inserted into D, see Definition 1 below.

^[6] If a record crosses the boundary of blocks, we move it to the next block and leave some space in the previous block unused.

^[7] ‘To overcome,’ ‘I will overcome,’ ‘you will overcome.’

^[8] To avoid too complicated notation, we use interchangeably the index of a string in D and the string itself when this does not cause any confusion.

^[9] This can be easily proved by analysis of the definition of the indices .