M. Alexandrov, A. Gelbukh, and P. Makagonov. Evaluation of Thematic Structure of Multidisciplinary Documents. Proc. DEXA-2000, 11th International Conference and Workshop on Database and Expert Systems Applications, NLIS-2000, 2nd International Workshop on Natural Language and Information Systems, Greenwich, England, September 4-8, 2000. IEEE Computer Society Press.

 

Evaluation of Thematic Structure
of Multidisciplinary Documents and Document Flows

 

Mikhail Alexandrov

Alexander Gelbukh

Center for Computing Research, National Polytechnic Institute,
Av. Juan de Dios Batiz, Zacatenco, 07738, DF, Mexico.

{dyner, gelbukh}@pollux.cic.ipn.mx

Pavel Makagonov

 

Moscow Mayor’s Directorate, Moscow City Government,
Novi Arbat 36, 13-th floor, Moscow 121205, Russia.

Ummpp@maria3.munic.msk.su

 

 


Abstract

Classification of documents of complex interdisciplinary character with high level of informational noise is considered. The set of classification domains is supposed to be fixed; a domain is defined by a keyword list. Thematic structure of individual document and a document flow are discussed. The technology was implemented in a system Document Recognizer that solves the following tasks: evaluation of contribution of each domain to a document, distribution of document flow by the domains, and selection of a possible leader (most representative document) in each group.

1.    Introduction

1.1.  Practical tasks

Let us consider two practical examples. The Mayor Directorate of Moscow City Government maintains a large database Sustainable cities of Russia consisting of tens of thousands of documents concerning various data on cities of Russia.  This database is used for analysis of quality of city management. In analysis process, the set of topics to be analyzed is fixed and the most interesting situations are usually those reflecting two or more topics simultaneously.

Another example: the Program Committee of a large Conference, such as IFCS’2000 (International Federation of Classification Societies), should distribute the submitted papers by sections. Interdisciplinary character of this conference makes classification of all 200 submissions very difficult.

In these and similar examples, the documents under consideration often have the following specific features.

First, the documents have high level of information noise – the information that is useless for classification. For example, typical documents in the mentioned database contain historical data, references to other publications, description of various difficulties faced with by the city government, etc. Thematic structure of the documents is to be detected basing on only 10% to 30% of useful information in the text. Though the papers submitted to the conference are more specialized, the fraction of useful information is reduced since abstracts (too short texts) are considered instead of full papers. Due to the interdisciplinary character of the data, sometimes the fraction of useful information in the abstracts is about 50%.

Second, many such documents are devoted to several themes in almost equal degree. For instance, a document in database can simultaneously reflect the problems of pensions for old people, activity of police, responsibility of municipal authority etc. At the interdisciplinary Conference (now the majority of large Conferences are interdisciplinary ones) as a rule one paper or abstract considers methods and applications belonging simultaneously to various domains.    

Such a situation is quite usual in many document-processing tasks in government, business, or scientific organizations, information agencies, etc.

1.2.  Related work

The present paper deals with document classification applications of a set of dictionaries and with visual representation of domain contribution. Dictionary-based algorithms of document classification similar to the methods we present here were described in [1]. However, in that paper a very large predefined concept tree is used; in contrast, we consider the case of a relatively small set of domains that the users can easily define or change.

There exist effective document classification algorithms relying on the differences in the frequency properties of the words in the general versus specific domain texts [2], [6]. In our case, however, no pre-existing knowledge about the general lexicon is used. Also, in the present work it is important that we deal with a set of dictionaries and not with one dictionary. While [3], [4], [5] discuss mostly the issues of compilation and maintaining of dictionaries, we concentrate on their use.

2.    Document metrization

2.1.  Domain dictionary

A set of domain dictionaries is necessary to obtain a numerical representation of the document, which permits to use the traditional methods of numerical analysis for the task of document classification.

We will use the term keyword to refer to any key expression that can be a single word or a word combination. What is more, we represent a keyword by a pattern describing a group of words with equivalent meaning. In such a pattern, the inflection for time, person, gender, number, etc., as well as part of speech distinction, some suffixes, etc., are ignored, e.g.: obligation, obligations, obligatory, oblige ® oblig-, where oblig- is the pattern representing all these words. For simplicity we call such a pattern a keyword.

A domain dictionary (DD) is a dictionary consisting of such keywords (i.e., patterns) supplied with the coefficients of importance for the given domain. The coefficient of importance is a number between 0 and 1 that reflects the fuzzy nature of the relationship between the keywords and the selected domain, i.e., a DD is a fuzzy set of the keywords.

The methodology of creating domain dictionaries includes analysis of both domain-oriented texts selected by the experts and the frequency list of general lexicon [4], [5]. In practice, the coefficients are determined basing on an expert’s or the user’s intuition. The general recommendations for their assignment are: the keyword that is essential for the given domain is assigned the weight > 0.8, an important one 0.6 to 0.8, regular 0.4 to 0.6, important for this and also for some other domains 0.2 to 0.4, typical for many domains < 0.2. If a domain dictionary does not contain these coefficients, they all are considered to be 1. In the simplest case a user can build DDs using her own system of preferences. Figure 1 shows one of the DDs used in our Computer Center for classification of business correspondence.

The domains that define the thematic structure of documents are supposed not to be significantly interrelated. In other words, the DDs under consideration have no significant intersection. If two DDs constructed by the user significantly intersect, they should be joint into one combined domain. The intersection is measured as that of fuzzy sets, i.e., basing on the coefficients of importance.

2.2.  Document image

Given a DD, for every document its so-called document image relative to the domain defined by the given DD can be built. Such an image is a list of the domain keywords with their corresponding numbers of occurrences in this document. Given several DDs, several images for a document are built, one for each domain. Figure 2 shows an example of a document image.

 

Figure 1.  A domain dictionary.

 

 

Figure 2. A document image relative to a specific domain.

 

 


Thus, each document is represented with a set of numerical vectors (X1j, X2j, ..., Xkj, ...,), where k is a number of text and j-a number of domain. Note that such a vector representation does not imply that any of the traditional vector operations can be used, since they do not represent any real vectors. In particular, the zero vector represents a document that has no relation to the selected theme. Consequently, no binary operations can be applied to this vector and any other one connected with the same theme.

2.3.  Operations with domain dictionaries and document images

The system Document Recognizer based on the technology being discussed allows the user to add and remove DDs, to view the contents of the DDs, and to automatically measure the intersections between them. This program also allows the user to add and remove documents to work with and to view their numerical images, see Figure 1 and 2. With these possibilities the user can quickly formulate and verify the hypothesis about the structure of the document-domain space.

3.    Thematic structure of a document

3.1.  Qualitative characteristics

Let us denote (X1j, X2j, ..., Xkj, ...,) the image of some i-th document for j-th domain, and   (A1j, A2j, ..., Akj, ...,) the coefficients of importance for the corresponding keywords. A naïve way to calculate the weight of the domain in a given document could be , where Lj is the size of DDj.  However, the domains are usually not in equal conditions: their DDs can have different sizes and very different importance coefficients. Thus, it is necessary to correct the weights of texts taking into account the “average power of the dictionary” Pj =. Thus, the correct weight is . This characteristic reflects the total amount of the relevant information in the document.

 

Figure 3. Thematic structure of a document
and representivity of various domains.

 

 


Alternatively, if we want to evaluate the correspondence of the document to the domain or to compare several documents, then the domain weights are normalized by the document size. For this, we consider a 1000 word document as a standard size document. If our real document contains M words then the normalizing coefficient is 1000/M, and the final normalized weight of j-th domain is Wj = (1000/M´ Wj. With this, if one concatenates two copies of the same document into a new document, the normalized weight will not change. On the other hand, if one concatenates a document with another document which has the same length and which has nothing to do with the given domain, the normalized weight decreases twice. These examples reflect the intuition of the share of the document occupied by a given domain.

When discussing the thematic structure of a document, one should take into consideration the relation between the themes reflected in this document. The documents having similar thematic structure must have similar relations between their themes, i.e., these documents must have similar thematic vectors (W1. W2,...Wj, ...). For comparing the thematic structures of the documents it the length of the thematic vectors is to be normalized. Such normalization may be realized in several ways depending on the task.

If the user wants to emphasize the most relevant domain for the document, the weights are normalized by the maximal weight: W¢j = Wj /WM, j = 1, ..., N , where N is the number of DDs and WM = max {Wj }. However, this operation has some deficiency: The most relevant domain always has relative weight 1, which creates an illusion of that the document is very closely related with this domain.

On the other hand, if the user wants to emphasize the relation between the domains in the document, the weights are normalized by the total weight: W¢j = Wj / WS, j = 1, ..., N, where N is the number of DDs, WS = . This operation has another deficiency: When more domains are added to the system (i.e., new DDs are attached to the program) the relative weight of every domain decreases. This creates the illusion that the document becomes less and less connected with each of the domains. And vice versa, when any domain is eliminated (detached from the program) the relative weight of the other domains increases, which creates the illusion of that the removed domain was some source of noise. In Document Recognizer, the first way is used: all thematic vectors are normalized on the domain with the maximal weight. However, in order to prevent any ambiguity in the definition of domain representativity, Document Recognizer uses additional qualitative characteristics represented visually by colors and their transparency, which are discussed below.

3.2.  Qualitative characteristics

 

Figure 4. Document flow structured by themes.

 

 


As it was mentioned before, the latter form of the normalization operation emphasizes the main theme in a document that thus always has weight 1. These operations in essence actually remove information noise from the document image. However, at the same time the information about the real contribution of domains to the document is lost. To indicate the real representativity of various domains in the documents, two characteristics should be taken into account simultaneously: the real density of keywords in the document and coverage of the dictionary.

In Document Recognizer, a common estimation of representativity for every domain is calculated basing on a combination of numerical equivalents of these qualitative estimations. It is a value in the interval (–2, 2), with the orrespondent qualitative estimations in the interval (Very low, Very high) presented visually as color transparency.

Figure 3 shows the thematic structure of a document EX4.TXT. As one can see, the relative contributions of the five domains are approximately (0.1, 0.2, 0.6, 0.95, 1.0). However, the second dominating theme (domain 4) is represented very weak and this theme must be excluded from consideration. High relative weight of this theme was caused by repetition of very limited list of keywords from the appropriate dictionary.  It means that some sub-domain of the given domain (or just an unrelated theme) was essentially represented in the document, rather than the whole given domain. This visual representation allows compensating of the information loss caused by normalization.

4.    Thematic structure of document flow

4.1.  Visual analysis

When a thematic structure of a document set is considered, it implies first of all looking for documents with similar structures. As it was mentioned before, the thematic structure of a document is represented by a thematic vector. Consequently, similar documents have similar direction of these vectors in the space of domains. Thus the correspondent measure of closeness reflects just the angle between  two thematic vectors (W’11, W’12, .., W’1N) and (W’21, W’22,.., W’2N). For this, a correlative measure is used, which is the inverse value to the correlation between the two vectors, i.e., the normalized scalar product:

.

Similar vectors have  » 1; this means that the distance D12 = 1 – R12  » 0, that corresponds to the intuitive notion of the distance. Document Recognizer orders the documents by their similarity visually presenting the results as a colored matrix. The thematic structure of the document flow can optionally be determined by the main themes (domains) of the documents.  Figure 4 presents a distribution of a document flow by the domains. The value of color represents a relative contribution of a specific domain. One can see that that the document flow shown on Figure 4 is multidisciplinary. Color matrices are very efficient for a fast intuitive evaluation of closeness between two or several large document flows. In our practice, experts can very quickly compare several document flows with such matrices.

4.2.  Choice of leader and program options

After the document flow has been subdivided into the groups, sometimes it is important to choose a leader in each group. By the leader we mean the most representative document in the group, i.e., the document that is most similar to all documents in the group. Such leader can represent its group in various situations where only one member of each group should be selected.

Table 1. Subjective estimations of domain representativity.

 

Number of keywords from  DD occurring in the document

Qualitative
estimations

 

Numerical
estimations

Density of keywords in the document
(per mille)

Qualitative
estimations

 

Numerical
estimations

More than 75

High

1

More than 50

High

1

25 to 75

Mean

0

10 to 50

Mean

0

Less than 25

Low

-1

Less than 10

Low

-1

Various criteria for choice of the leader can be suggested: (1) the maximum number of the main domains, i.e., the domains that have the relative weight 1, (2) the maximum thematic complexity – the sum of domain weights, or (3) the maximum relevance to the main domain taking into account the absolute weight of the domain. For example, the former two criteria can be used for the selection of chairpersons at the conference sessions.  On Figure 4, leaders are marked by with a small circle in the table header.

5.    Conclusions and future work

In this paper, the problems of evaluation of the thematic structure for one document and a set of documents have been considered under condition of high level of information noise and the presence of several domains in almost equal degree.  The solution of this problem is possible on the basis of domain dictionaries containing domain-oriented sets of keywords. Formal quantitative characteristics for evaluation of document thematic structure are suggested. However, they have limited possibilities and do not solve the problem completely. Some qualitative characteristics compensating for these limitations have been suggested. These characteristics rely on visual analysis that is used both for analysis of one document and of document flow.

A system Document Recognizer realizing the technology being discussed has been presented. It is currently used in the Mayor Directorate of Moscow City Government. It was used in the experiments with abstracts submitted for the 2nd International Conferences APORS’97 and IFCS’2000 [4], [5]. Now this program is being tested in the Department of Environment Protection of Mexico City Government for the working with the text archive of ecological data. In our future work, we plan to implement more functions reported as desirable by the current users, in order to turn our system into a convenient workplace for text classification.

6.    References

[1]   Guzman-Arenas, A. (1998).Finding the main themes in a Spanish document. Intern. J.Expert Systems with Applications,v.14, N 1/2,139-148.

[2]   Feldman, R., Dagan, I. (1995). Knowledge Discovery in Textual Databases. Proc. of Intern. Symposium “KDD-95”( Montreal, 1995). Montreal, pp.112-117

[3]   Lelu, A., Ferhan, S. (1998): Clustering a textual data flow by incremental density-modes seeking. In: Data Science, Classification and Related Methods (Proc. of 6-th Intern. Conf. IFCS, Rome, Italy, 1998). Rome, pp. 206-209.

[4]   Makagonov, P., Alexandrov, M. and  Sboychakov, K. (1999): Searching in full text Data Bases by using text patterns. In Pedro Galicia (Ed):  Proc. of Intern. Computer Symposium  CIC’99 (Mexico, 1999). National Polytechnic Institute, Mexico, 17-29

[5]   Makagonov, P., Alexandrov, M. and  Sboychakov, K. (2000): A toolkit for development of the domain-oriented dictionaries for structuring document flow. Proc. of 7-th Intern. Symposium “IFCS-2000”( Namur, 2000), Shpringer-Verlag “Studies in Classification, Data Analysis, and Knowledge Organization” (to be published).

[6]   Gelbukh, A., G. Sidorov, and A. Guzmán-Arenas. A Method of Describing Document Contents through Topic Selection. Proc. SPIRE’99, International Symposium on String Processing and Information Retrieval, Cancun, Mexico, September 22 – 24. IEEE Computer Society Press, 1999, pp. 73-80.