Measures for Determining Thematic Structure of Documents with Domain Dictionaries

Abstract

The paper discusses investigation of the thematic structure of individual documents, and of the relationships and grouping within a document collection. The use of domain-oriented dictionaries is considered. The appropriate characteristics of the documents and their visual graphical representation are discussed, as well as the metrics that serve best as similarity measures for investigation of the structure of document collection. The results are implemented in our system Text Recognizer used in the Mayor Department of Moscow Government.

1 Introduction*

Let us consider three very common situations. A government office receives tens of thousands of appeals from the citizens. A scientist or student reviews a large collection of articles downloaded from the Internet. The conference organizing committee is preparing the program on the base of hundreds of submitted articles.

In any such situation, the following two operations are to be performed: (1) define the contribution of themes to each document, (2) define the groups of documents that have similar thematic structure.

The data has the following specifics: (1) the set of domains of interest is relatively small and fixed, (2) the documents are noisy, with only 10% to 30% of the text being useful information, and (3) some documents contain a mixture of the domains in nearly equal proportion.

The first task improves distribution of the documents by, say, departments of the institution. The second task allows for extraction of some hidden information from the set of the documents, which is text mining proper. In this article, we describe our experience with our Text Recognizer system currently being in use in the Mayor Department of the Moscow Government.

While [Feldman et al., 1995] discusses the use of large structured dictionaries, we use very small weighted word lists. [Andriaans et al., 1996] discussed visualization of document relationships, but in our case the nature of relationships is different. Document grouping based on similarity is discussed in [Hall, 1995], though we use visual approach.

Our group has reported some developments in dictionary compilation and document classification in [Guzmán-Arenas, 1998; Gelbukh et al., 1998; Alexandrov et al., 1999; Makagonov, 1997]. In this article, however, we concentrate on the visual methods and the use of simple, easily compilable dictionaries.

2 Document image

A domain dictionary is a dictionary containing key expressions (words or word combinations) and their appropriate coefficients of importance (weights) for the given domain. Hereafter, we will use a more familiar term keyword to refer to any key expression, which can be one word or word combination.

In our system, the words having a common meaning and a common root, like obligation, oblige, obligatory, are handling as one dictionary entry.

A document image related to a given domain (dictionary) is a list of the keywords found in the document, with their respective numbers of occurrences.

3 Relations within a document

Let us consider several domains. The weight of the domain j in a given document is , where X_kj is the frequency of the keyword k in the document and A_k is the importance coefficient of the keyword for the given domain.

Depending on the task, the weights can be optionally normalized by the size of the document: W_j= , where N is the number of words in the document.

Additionally, depending on the task, the weights can be normalized within the set of domains: , where W = {W_j}, and can be estimated in different ways: (1) If we need to emphasize the most relevant domain, then ; (2a) If we need to emphasize the relation between the domains within the document, then (linear length) or (2b) (quadratic length). We do not discuss here the difference between the two latter cases. Note that such relative weights reflect the relations of domains within the document rather than the contribution of individual domains.

However, the weights W_j do not represent all the necessary information about the relationship between a document and domain. While it measures the “quantity” of such relationship, we introduce another measure to represent its “quality:” the representativity of a domain for the text. It directly depends on (1) the share of the domain dictionary in the document, and (2) the density of keywords in the document.

Thus, a document is characterized by two measures. In visual representation, we found it convenient to represent the “quantity” W_j as the height of a bar on a histogram, while the representativity as its color: the less quality, the paler color is used.

4 Relations between documents

To measure the similarity, or distance, between two documents, D₁ and D₂, we use one of the following measures D₁₂: (1) If we want small differences in many coordinates to affect our measure, then we use a linear metric D₁₂ = ; (2) If we do not want many small differences to affect the measure, then we use a quadratic metric D₁₂ = . Additionally, if we want small differences between documents to be ignored and large differences emphasized, we us a square power of these measures: .

The options discussed above (normalization by the document size, by domains, choice of metric, and squaring) result in 24 variants of distance calculation scheme, each used for its specific combination of goals. For instance, we found that the most commonly used is the following metric: weights normalized by document size and by domain with quadratic length, quadratic metric, squaring. This is equivalent to a so-called special correlative measure (mathematically it is not a metric): 1 – .

These distance measures allow for automatic and supervised clusterization of a document collection or the set of domain dictionaries. Especially useful for supervised clusterization proved to be a combination of automatic initial clusterization and visual representation of the matrix W.

Acknowledgments

The Text Recognizer system is a part of the product family Text Investigator developed under direction of Dr. P. Makagonov, Moscow Mayor Department, Russia.

References

[Guzmán-Arenas, 1998] A. Guzmán-Arenas. Finding the main themes in a Spanish document. Intern. J. Expert Systems with Applications, 14(1/2):139-148, 1998.

[Gelbukh et al., 1998] A. Gelbukh, G. Sidorov, A. Guzmán-Arenas. A system of document search and classification with the use of a hierarchical topic dictionary (in Russian). In Proc. VIII-th International Conference Knowledge-Dialogue-Solution (KDS-99), Yalta, Ukraine, September 1999 (to appear).

[Feldman et al., 1995] R. Feldman, I. Dagan. Knowledge Discovery in Textual Databases (KDT), In Proc. of Intern. Symposium “KDD-95”, pages 112-117, Montreal, Canada, 1995.

[Alexandrov et al., 1999] M. Alexandrov, P.Makagonov, K. Sboychakov. Searching similar texts: some approaches to solutions. In Acta Academia – 1999, pages 215-223, International Informatization Academy, in consultative status with United Nation Org., Branch of Moldova. Chisinau, Evrika, 1999.

[Andriaans et al., 1996] P. Andriaans, D. Zantinge. Data Mining, Addison-Wessley, 159 pages, 1996.

[Hall, 1995] Curt Hall. The devil’s in the details: techniques, tools, and applications for database mining and knowledge discovery – part 1. In Intern. J. Intelligence software strategies, XI(9):1-16, 1995.

[Makagonov, 1997] P. Makagonov, Computer tool kit for structuring and classifying the information for solution of social, political, economic and regional problems in the Moscow Mayor's Office. In Proc. of Intern. Symposium CIC-97, pages 52-85, IPN, Mexico, 1997.

* Work done under partial support of CONACyT, REDII-CONACyT, and SNI, Mexico.