Lazy Query Enrichment: A Method for Indexing Large Specialized Document Bases with Morphology and Concept Hierarchy

Alexander Gelbukh. Lazy Query Enrichment: A Method for Indexing Large Specialized Document Bases with Morphology and Concept Hierarchy. Proc. DEXA-2000, 11^th International Conference and Workshop on Database and Expert Systems Applications, Greenwich, England, September 4-8, 2000. Lecture Notes in Computer Science, Springer.

Lazy Query Enrichment:
A Method for Indexing Large Specialized Document Bases
with Morphology and Concept Hierarchy

Alexander F. Gelbukh

Center for Computing Research (CIC),
National Polytechnic Institute (IPN),
Av. Juan Dios Bátiz s/n esq. Mendizábal,
Col. Zacatenco, C.P. 07738, D.F., Mexico

gelbukh(?)cic.ipn.mx

Abstract. A full-text information retrieval system has to deal with various phenomena of string equivalence: ignore case matching, morphological inflection, derivation, synonymy, and hyponymy or hyperonymy. Technically, this can be handled either at the time of indexing by reducing equivalent strings to a common form or at the time of query processing by enriching the query with the whole set of the equivalent forms. We argue for that the latter way allows for greater flexibility and easier maintenance, while being more affordable than it is usually considered. Our proposal consists in enriching the query only with those forms that really appear in the document base. Our experiments with a thesaurus-based information retrieval system showed only insignificant increase of the query size on average with a 200-megabyte document base, even with highly inflective Spanish language.

1 Introduction^*

Any information retrieval system has to somehow deal with the problem of non-literal matching of the query and the document keywords. For example, given a query computer, the system should be able to retrieve (or not, depending on the user-defined settings and query options) the documents containing the strings Computer, computers, computation, mainframe, motherboard, Internet, etc. There are two places in the system architecture where this problem can be dealt with:

· At the moment of indexing the documents – index enrichment – or

· At the moment of processing of the specific query – query enrichment.

The former technique is most commonly used due to apparently prohibitively serious problems caused by the latter one. We will show, however, that the former method has its own disadvantages, and that the problems of the latter method can be efficiently solved.

Our main motivation was the development of an information retrieval system for the Senate of Mexican Republic. Our customer’s priorities were the following:

· The quality of search; expressive power, and flexibility of the query language.

· Small index size and low load on the server during updating the database contents.

· Minimal changes to the existing maintenance technology and database structure.

· The system was to be operational while the dictionaries and grammars are under development, and the improvements and corrections to them were to be immediately available to the users.

On the other hand, computational efficiency of processing of a single query was of lower priority since the number of the users was rather limited. The characteristic properties of the document base at hand were the following:

· Large size, in the order of a gigabyte, to be extended to several gigabytes,

· Specialized contents with limited variety of lexicon and syntactic constructions,

· Still, unrestricted language with the possibility of occasional use of any word form.

In this work, we are interested in a flexible, computationally efficient, conceptually simple, and easily maintainable solution of the problem of non-literal matching under the requirements and circumstances listed above.

1.1 Related work

There is a vast literature on approximate string matching; a variety of data structures, such as tries, B-trees, etc. were suggested [1, 5, 8]. These works are based on implicit or explicit letter patterns. However, here we consider the problem of matching arbitrary word sets that might not share any simple letter pattern. E.g., the strings dormía, duermo, and durmiendo are forms of the same Spanish verb dormir ‘to sleep;’ the strings church, priest, and pilgrim represent the same concept religion though they do not match any particular letter pattern for approximate string matching to be applied.

The problem of generating and matching the word forms in various languages, including English and Spanish, is well studied in linguistics. Various methods and data structures are suggested in computational linguistics for handling the corresponding dictionaries and morpheme lists [3, 9, 10]. However, in this article we do not discuss the problems of natural language morphology. Instead, we are interested in application of a morphological analyzer to the purely database management task of retrieval of a keyword in all its forms. The list of the word forms is supposed to be already known, while these forms are not supposed to match any particular letter pattern.

The use of concept hierarchies for topical document analysis was suggested in [6] and applied to the information retrieval tasks in [4]. Various large hierarchical thesauri have been compiled [2, 7, 12]. Again, here we are interested not in the compilation or tuning of the thesaurus itself but rather in the way the documents relevant for a specific node can be in practice found in an existing large document base.

2 Types of non-literal string matching

Here we will discuss in more detail the types of the strings that the user might want, or not want, to be matched. An important point in each case is the great degree of flexibility necessary to meet the requirements of a specific user or a specific search.

2.1 Letter case

This is the simplest type of non-literal matching: usually the strings like computer, Computer, COMPUTER, and ComPuTer are to be considered equivalent. On the other case, the user should have an option to turn off such equivalence, e.g., to find Bill or Mainframe (personal names) but not bill nor mainframe, CIS (the name of an organization) but not Cis (personal name).

2.2 Morphology

The second class of strings that frequently are considered equivalent are the word forms of the same lexeme: compute, computing, computed, or its derivational variants: computer, computation, computational, computability.

On the other hand, matching the morphological forms of the same stem is not always desirable. For example, a user can be interested in computers, but not in computation. Very annoying is morphological reduction of ambiguous forms, especially in highly inflective languages such as Spanish. E.g., the Spanish verb comer ‘to eat’ form about 700 morphological variants like comiste ‘you ate’ or comiéndotela ‘you eating it up’, one of which – namely como ‘I eat’ – happens to be homonymous with a very frequently used conjunction como ‘as,’ ‘how.’ Thus, to find the documents with the Spanish lexeme comer with a reasonable precision, one has to sacrifice recall a little bit by forming the query as “all word forms of comer but como.”^[1]

Thus, the user should be able to control the application and the degree of morphological normalization applied to the query by the system.

2.3 Concept hierarchy

The third class of the words that might be considered equivalent are synonyms (processor/CPU), hyponyms (computer/mainframe), hypernyms (computer/device), and possibly other related words.

A dictionary hierarchical thesaurus is used to provide this option to the user. In our system, we use a 33,000-word dictionary organized in a deep hierarchy of related concepts, similar to, and in part derived from, the Roget thesaurus [2]. By related concepts, we mean not only the is-a relationship, but also other words that are of interest to the user looking for the documents on a given topic [6]. For example, the entry for religion contains such words as Bible, priest, pray, church, pilgrim, etc. Thus, the user looking for the documents on religion will be offered a document that mentions Bible. Optionally the degree of such relationship can be weighted quantitatively to measure the relevance of the found document [4].

Obviously, the user should have a full control over the set of words to be considered equivalent to the query keyword(s), since the words that are synonymous for one user might well be very different for another. The following options, or some their combination, are of particular interest:

1. All instances of a given concept, i.e., all words below a given node.

2. Near-immediate instances of a given concept, i.e., the words below a given node but not deeper than n levels.

3. Similar concepts, the words located in the concept tree not farther than n steps from the given one, be the steps in the down, up, or horizontal direction in the tree.

4. General concepts, i.e., the words of which the given node is an instance.

3 Index enrichment

In the previous section, we discussed four cases of identity of the strings: letter case, morphologically inflected forms, synonyms, and a concept tree. A naïve – and the most frequently used – approach to represent the first three cases of identity is index reduction: at the moment of indexing, all letters are reduced, say, to lowercase; all word forms are reduced to the main form (computing, computed, computes, computation, computer ® compute), and synonyms are replaced with one chosen representative (CPU ® processor). The latter case – a concept tree – can be handled by additionally indexing each document with the hypernyms of the words it contains (mainframe ® computer, device, artifact); with this method, a query “devices” will retrieve also mainframe.

In this article, we argue that this naïve approach has serious disadvantages. First of all, as we have shown in the previous section, depending on the desired precision/recall ratio, the user might not want such strings to be considered identical. Thus, indexing process should not cause any loss of information – i.e., all the letter strings should appear in the index as is, without any change, even in the letter case (i.e., reducing to lowercase). To achieve this, the strings reduced in letter case, or morphologically, or by a thesaurus should appear in the index in addition to (rather than instead of) the original strings. We call this process index enrichment.

To allow for certain flexibility of the queries, some improvements to this indexing scheme can be suggested. For instance, the additional keys are to be marked somehow to be distinguishable, if needed, from the original ones. Other possible improvements will be discussed in section 8. However, the index enrichment method presents some inherent problems:

· Lack of flexibility. Only the types of queries for which the index was specifically designed can be executed. The user cannot choose what words of a given set are to be considered equivalent, e.g., “all word forms of compute but computing;” see also the discussion of the example with the Spanish comer in section 2.2, and also the footnote there.

· Lager index. Unlike index reduction, index enrichment can significantly – from twice to tenfold, depending of the use of only morphology or also a thesaurus – increase the index size.

· Maintenance difficulties. Too close coupling of the indexing process and complex lingware – the morphological analyzer and the thesaurus – presents both organizational and technical problems.

- Adding the intelligent search engine to an existing database maintenance technology requires significant changes in the latter, which implies changing existing software and documentation, training the maintenance engineers, etc.

- Unlike stable database maintenance procedures, complex dictionary-based lingware tends to be – at least initially – in constant development. With index reduction or enrichment, each time a change is made to the lingware, the whole database is to be re-indexed, which is often not affordable, especially when the linguistic processing is slow and resource-consuming. On the other hand, delaying re-indexing for a long time greatly discourages any improvements to the lingware, from the point of view of both the developers and the customer.

4 Query enrichment

An alternative to handling non-literal string matching at the stage of indexing is handling it at the stage of query processing. A naïve approach to this method is the following. The letter strings found in the documents are indexed as is, without any changes. Then, at the moment of query processing, the user query is automatically substituted with an appropriate logical expression, e.g., the query “compute and matrix” internally is executed as “(compute OR computes OR computed OR computing) and (matrix OR matrixes OR matrices).” We call this procedure query enrichment; it is also known as query expansion [11].

This method does not present any of the problems listed in the previous section. Namely, it has the following advantages over index enrichment or reduction:

· Flexibility. The user can edit the resulting expression to achieve any desired combination. For example, the query “all forms of the Spanish verb comer but como” can be naturally expressed by the user and processed by the system.

· Easier scoring. Enrichment can be done gradually: first the unmodified query is performed; only if it does not give the necessary results, the query words are substituted with their most close equivalents and the search is repeated; only if this search does not give the necessary results, the query words are substituted with their less close equivalents, etc. With this, the documents that closer corresponds to the query will be found first.

· Smaller index as compared with index enrichment. Only the strings literally present in the document are present in the index.

· Easy maintenance. The indexing procedure is trivial and does not include, nor depends on, any lingware. No changes to the existing non-intelligent indexing technology are necessary when adding an intelligent search engine to an operational database. No re-indexing is necessary when changes are made to the lingware, and such changes are available immediately to the user.

However, the disadvantages of this naïve approach are so obvious that it cannot be considered as a practical alternative. Namely, the following two problems render such a method practically unusable:

· Too large queries. As we have mentioned, the Spanish verb comer form about 700 variants, which results in too large query. With a thesaurus, the concept Europe would contribute to the query all countries, cities, rivers, mountains, nations, types of food, etc. specific for Europe. In addition, each of these strings should be capitalized in all possible ways.

· Generation. It is well known that generating all forms of a given lexeme (compute ® compute, computing, ..., uncomputable, ...) is a task significantly more difficult than guessing the correct main form or stem of a given word form (compute, computing, uncomputable ® compute). In case of a heuristic-based (rather than dictionary-based) morphological algorithm, the number of hypotheses in form generation is usually much greater than in reducing to the stem.

In the next section, we will show how these problems can be worked around.

5 Lazy query enrichment

The improvement we suggest for the method of query enrichment consists in including into the enriched query only the strings known to be present in at least one document of the given database. Since the diversity of language in a specialized document base (such as legal, medicine, etc. texts) is relatively low, only a small fraction of all possible forms of a word or sub-concepts of a concept are present in the database, which greatly reduces the size of the enriched query. At the same time, when applied to the specific database, such a reduced query is equivalent to the fully enriched query. We call this modification of the enrichment procedure lazy enrichment.

The process of lazy query enrichment can be sketched as following.

· A list of all strings that appear at least once in the given database is build.

· This relatively small list is indexed as described in section 3, which produces an index table such as the following:

Letter case		Derivation		Thesaurus
String	Identifier	String	Identifier	String	Identifier
computer	computer	computes	compute	mainframe	computer
Computer	computer	computing	compute	mainframe	device
CompuTer	computer	uncomputability	compute	mainframe	artifact

Here by the identifier (ID) we mean a reduced form, such as reduced to the lowercase, morphologically reduced to the main form, promoted up the tree in the thesaurus, etc., see section 2 (we do not show in this table the improvements discussed in the sections 3 and 8).

· At the stage of processing the query, each keyword of the query is subject to appropriate reduction depending on the user-defined option, for example, to morphological reduction to its main form, e.g., computable ® compute, thus giving a potential ID. In case of ambiguity, all potential IDs are obtained.

· The ID(s) for each query keyword are looked up in the right column of the table, and the word is substituted with the list of the corresponding literal strings found in the left column.

For example, with the table above, the query “what things are computable?” is transformed first into the query “ID = compute” and after the lookup in the table, into the query “computes OR computing OR uncomputability.” Note that it does not contain such strings as compute, computed, nor the very source form computable if they do not occur in the documents in the database.

To process a complex query, such as, for example, a query of the type 3 discussed in the section 2.3, the thesaurus is navigated correspondingly and the query is first built as a disjunction of the relevant lexemes (concepts) and then is enriched through the index table as described above.

The suggested modification of the of the query enrichment method does not have any disadvantages of the latter, thus presenting the following advantages as compared with full query enrichment:

· Smaller queries. Only the words really appearing in the database are included into the query. The difference is especially sensible in the languages with developed morphology. For example, of about 700 forms of the frequently used Spanish verb comer ‘to eat,’ in the database of the Senate of Mexican Republic appeared only 29; of more than 100 forms of falsificar ‘to falsify,’ appeared only 11, etc.

· Only reduction. The algorithm does not use any generation; instead, only reduction is used, such as reduction to lowercase or morphological reduction. This greatly simplifies the lingware, allowing for a rather simple heuristic-based morphological algorithm.

Of course, the suggested method still has some disadvantages as compared with the full query enrichment or index enrichment methods, among them the following ones:

· Need to maintain the list of strings. As compared to the full query enrichment, the suggested method requires to maintain an additional data structure. In the next section we will show that this does not present serious maintenance problems.

· Still increase in query size. As compared to the index enrichment, the queries are still increased in size, though not as much as with full query enrichment.

6 System architecture

Our system is built on top of the existing operational technology treated as a black box. The user query is intercepted, analyzed, and substituted with an enriched query using the lazy query enrichment technique. The new query is presented to the user in a user-friendly form for possible editing. At this stage, gradual enrichment described in the section 4 can be applied.

In our case, the system is based on Informix DataBlade platform. This DBMS allows for specifying with each query a so-called synonym list. It looks much like a synonym dictionary: for each headword, it lists the strings that will be treated by the system as equivalents for each occurrence of the headword in the query. This feature proved to be ideal for the query enrichment technique: instead of enriching the query “computers AND linguistics” as “(computer OR computers OR computational OR ...) AND (linguistics OR language OR linguist OR ...)” we leave the query untouched and provide as a search parameter a synonym list containing the entries such as:

computer: computers, computational, compute, PC, workstation, mainframe, ...

linguistics: language, linguist, linguistic, English, Spanish, HPSG, ...

To maintain the list of strings that occur in the database, the database index can be used directly to update the list each time the database contents updated. Alternatively, if such an internal structure of the existing DBMS is not easily accessible, the documents can be indexed by a separate program.

In our system, for the index updating procedure to be independent from the pre-existing database updating technology, an agent periodically searches the database for the new documents, processes them, and marks as processed. After the list of strings has been updated, the new strings are subject to all necessary types of reduction (morphological forms, synonyms, etc.) as described in Section 5. Finally, if the lingware or dictionaries are updated, the whole list is re-indexed; due to small size of the list, this does not present any serious problem.

7 Experimental results

We investigated a 200 MB subset of the database of the Mexican Senate containing a representative mixture of the discourses of the Senators, laws, and other working documents of the Senate. The corpus contained 21,378,740 letter strings (running words), of them, only 174,386 strings different ones (0.8%). Then we reduced the strings in various ways. Obviously, the ratio of such reduction is equal to the ratio of inflating the query when lazy query enrichment is used.

First, lowercase reduction gave 102,715 different strings, which shows that with only letter case equivalences taken into account, the query is inflated insignificantly – less than twice.

Morphological reduction using a simple postfix removing heuristics showed that the database used 55,489 different stems. Therefore, the average number of strings per stem – i.e., the average ratio of lazy query enrichment using only morphology – was about 4. Here we fully distinguished lowercase and uppercase letters in query enrichment: e.g., the stem cultiv- was represented by three strings: Cultiva, cultiva, and cultivaron.

Thesaurus-based enrichment showed a bit less promising results. Here are two examples. In our dictionary, the concept a Mexican city consists of 2,413 names. The database happened to mention 1,130 such words with case ignoring comparison, or 1,780 strings if uppercase and lowercase letters are distinguished. The concept body parts in our dictionary is represented by 97 words; the database happened to mention 55 of them, or 86 strings with letter case distinguished.

Thus, with thesaurus-based enrichment, the query inflation ratio is high and might be not affordable in a practical system.^[2] However, as we have mentioned, due to the very nature of a thesaurus reflecting “general” knowledge that often proves to be not suitable for a particular user, as well as due to rather low quality of existing dictionaries, this type of enrichment especially needs in the degree of user’s control that cannot be provided by index enrichment. Therefore, we consider the disadvantage of the query enrichment method to be technical, i.e., temporal, while its advantage – a greater degree of control – to be fundamental. Note that as the dictionary is being elaborated, the inflation ratio will not significantly increase since the new words being added to the dictionary are of low frequency in the texts. In the next section, a possible workaround for the problem of too high query inflation will be discussed.

8 Future work: A combined technique

Even with the proposed technique, query enrichment still slows down the system by inflating the user query, especially in case of thesaurus-based enrichment. On the other hand, index enrichment has an advantage of using the context of the word:

· Multiword expressions and idioms in the thesaurus, such as hot dog, can be handled naturally at the stage of indexing of the full text of the document.

· The words can be disambiguated in context; e.g., with the query “wells,” the text oil well will be found while he did it well not.

· The structure of the document can be taken into account, e.g., words in the title or abstract can be indexed differently from the words in the main text.

· The global properties of the document not related with any specific word in it, such as the main topic of the document [4, 6], can be used for indexing.

To provide these features without sacrificing the flexibility of the query language, the index enrichment can be used in combination with the query enrichment. The first step to such combination is the following. Both methods are implemented in the system; in particular, the documents are indexed with index enrichment as explained in section 3. The user queries of standard types such as full morphological reduction or a full thesaurus-based query (see section 2.3), are processed fast with the enriched index without any query enrichment. In the rather rare case when the user somehow modifies the list of strings to be matched, index enrichment is used.

The division of work between the two methods can be optimized in various ways. For example, only deep levels of the thesaurus can be considered for index enrichment (matrix, equation, inequality, ... ® @mathematics; here we mark the introduced hypernyms with @, as mentioned in Section 3), while the upper level hierarchy, if need by the query, can be taken into account by query enrichment (science ® @mathematics OR @physics OR @chemistry). What is more, the way the users most frequently modify their queries can be automatically, semi-automatically, or manually learned and implemented in the index enrichment. For instance, if the users frequently exclude the form como from the paradigm of the Spanish verb comer (see section 2.2), then it should be excluded at the stage of index enrichment. The query with the unmodified paradigm will be internally (and transparently for the user) implemented, if needed, through query enrichment as “=comer OR como” (supposing that morphologically reduced forms are marked with =).

The combined technique compensates for the query inflation problem caused by query enrichment, especially of the thesaurus-based type. It has the advantage of higher performance due to smaller queries, without sacrificing flexibility. Obviously, it implies both advantages and disadvantages of index enrichment. Specifically, it gives the advantage of the possibility to consider the context. On the other hand, it introduces the methodological and technical disadvantages of index enrichment listed in the section 3, such as maintenance problems and undesirable coupling of the DBMS and the lingware.

9 Conclusions

We have shown that the traditional technique for non-literal string matching in information retrieval – index enrichment – has some inherent disadvantages, and have suggested a new technique – lazy query enrichment – that allows greater flexibility of queries, better overall system architecture, and easier maintenance.

Our method still has two problems: (1) the enriched queries in some cases are significantly larger and thus work slower, and (2) it is not obvious how to take the context of the keyword into account. As a solution, a combination of the query and index enrichment methods was discussed.

10 References

1. Aho, Alfred V. Algorithms for finding patterns in strings. In J. van Leeuwen (ed.), Handbook of Theoretical Computer Science, chapter 5, pp. 254-300. Elsevier Science Publishers B. V., 1990.

2. Cassidy P. An Investigation of the Semantic Relations in the Roget’s Thesaurus: Preliminary Results. In: A. Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, IPN-UNAM, Mexico, to appear. See also Proc. of CICLing-2000, February 2000, CIC-IPN, Mexico City, ISBN 970-18-4206-5.

3. Gelbukh, A. A data structure for prefix search under access locality requirements and its application to spelling correction. Proc. of MICAI-2000: Mexican International Conference on Artificial Intelligence, Acapulco, Mexico, 2000.

4. Gelbukh, A., G. Sidorov, and A. Guzmán-Arenas. Use of a Weighted Topic Hierarchy for Document Classification, Matoušek et al., TSD-99: Text, Speech, Dialogue. Lecture Notes in Artificial Intelligence N 1692, Springer, 1999.

5. Gusfield, Dan. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997; ISBN: 0521585198.

6. Guzmán-Arenas, Adolfo. Finding the main themes in a Spanish document, Journal Expert Systems with Applications, Vol. 14, No. 1/2. Jan/Feb 1998, pp. 139-148.

7. Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.

8. Frakes, W., and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.

9. Hausser, Ronald. Three principled methods of automatic word form recognition. Proc. of VEXTAL: Venecia per il Tratamento Automatico delle Lingue. Venice, Italy, Sept. 1999.

10. Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki Publications, N 11, 1983.

11. Kowalski, Gerald. Information Retrieval Systems Theory and Implementation, Kluwer Academic Publishers, 1997.

12. Lenat, D. B. and R. V. Guha. Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley, 1990. See also more recent publications on CYC project, http://www.cyc.com.

^* Work done under partial support of the Senate of Mexican Republic, CONACyT (Mexico), and CGEPI-IPN (Mexico).

^[1] Note that this effect cannot be achieved with a simple logical expression “all documents containing the word forms of comer but those containing the word como” since its meaning is not equivalent to the desired one. Namely, the recall with such a query would be extremely low since nearly any Spanish document does contain the string como ‘as,’ ‘how.’

^[2] In our system, the application of query enrichment is based on the Informix DataBlade’s synonym list feature, as described in the section 6. Our tests showed that this feature works with the entry size up to 3,000 synonyms for one headword. Thus, with this particular platform, thousand-fold enrichment is still possible.