"POLIBITS"

Research journal
on Computer science and computer engineering with applications

Issue 39 (January-June 2009)

Scanned cover pages
Editorial (p. 3), Alexander Gelbukh

SPECIAL ISSUE:

NATURAL LANGUAGE PROCESSING AND KNOWLEDGE MANAGEMENT

Guest Editor: Alexander Gelbukh

1. Axel-Cyrille Ngonga Ngomo and Frank Schumacher (Germany)

Disentangling the Wikipedia Category Graph for Corpus Extraction (pp. 5-10)

In several areas of research such as knowledge management and natural language processing, domain-specific corpora are required for tasks such as terminology extraction and ontology learning. The presented investigations herein are based on the assumption that Wikipedia can be used for the purpose of corpus extraction. It presents the advantage of possessing a semantic layer, which should ease the extraction of domain-specific corpora. Yet, as the Wikipedia category graph is scale-free, it can not be used as it is for these purposes. In this paper, we propose a novel approach to graph clustering called BorderFlow, which we use and evaluate on the Wikipedia category graph. Additional possible applications of these results in the area of information retrieval are presented.

 

2.
Ronald Winnemöller  (Germany)

Semantic Enterprise Search (but no Web 2.0) (pp. 11-17)

In this paper, we propose semantic enterprise search as promising technical methodology for improving on accessibility to institutional knowledge. We briefly discuss the nature of knowledge and ignorance in respect to web-based information retrieval before introducing our particular view on semantic search as tight fusion of search engine and semantic web technologies, based on semantic annotations and the concept of intra-institutionwise distributed extensibility while still maintaining free keyword search functionality. Consequently , our architecture implementation makes strong use of the Aperture and Lucene software frameworks but introduces the novel concept of "RDF documents". Because our prototype system is not yet complete, we are not able to provide performance statistics but instead we present a concise example scenario.

 

3. Sergey Yablonsky (Russia)

Semantic Web Framework for Development of Very Large Ontologies (pp. 19-26)

This paper deals with the development of the Semantic Web framework for very large ontologies. The Semantic Web is often associated with specific XML-based standards for semantics, such as RDF and OWL. Application of lexical ontologies such as WordNet and others for different tasks on the Semantic Web requires their representation in RDF and/or OWL formats with possibility of the different ontology mappings, semantic workflows, services and other semantic technologies.

 

4. Saïd Radhouani, Claire-Lise Mottaz Jiang, and Gilles Falquet (Switzerland)

FlexIR: a Domain-Specific Information Retrieval System (pp. 27-31)

We present a precise search engine adapted to professional environments which are characterized by a domain (e.g. medicine, law, sport, and so on). In our approach, each domain has its own terminology (i.e. a set of terms that denote its concepts: team, player, etc.) and it is organized along dimensions, such as person, location, etc. The dimensions, as described below, are made of concepts and semantic relationships that represent a particular perspective or point of view on the domain. We mainly use the notion of domain dimension to: i) precisely index document content, and ii) develop an interactive interface which allows the user to precisely describe his or her information need and therefore precisely access the document collection.

 

5. Jianshu Sun, Chong Long , Xiaoyan Zhu, and Minlie Huang (China)

Mining Reviews for Product Comparison and Recommendation (pp. 33-40)

Recently, as the amount of customer reviews grows rapidly on product service websites, it costs customers much time to select and compare their favorite products. Researchers have been aware of this problem and many studies are investigated to mine the opinions from the online reviews. Unfortunately, few previous works give comparisons or recommendations among the products. In this paper, we propose an automated system to address this problem. We first build a product feature sentiment database from the reviews. Then we perform the comparison among various products from both subjective and objective perspectives on the feature level. Finally, product recommendations can be suggested according to the previous comparisons and an evolution tree constructed from the reviews. Experiment results demonstrate the effectiveness of the proposed approach in mining the digital camera reviews. And now a demo system is put in to practical use.

 

6. Cerstin Mahlow and Michael Piotrowski (Switzerland)

SMM: Detailed, Structured Morphological Analysis for Spanish (pp. 41-48)

 We present a morphological analyzer for Spanish called SMM. SMM is implemented in the grammar development framework Malaga, which is based on the formalism of  Left-Associative Grammar. We briefly present the Malaga framework, describe the implementation decisions for some interesting morphological phenomena of Spanish, and report on the evaluation results from the analysis of corpora. SMM was originally only designed for analyzing word forms; in this article we outline two approaches for using SMM and the facilities provided by Malaga to also generate verbal paradigms. SMM can also be embedded into applications by making use of the Malaga programming interface; we briefly discuss some application scenarios.

 

7. Claudiu Mihăilă, Corina Forăscu, and Sabin C. Buraga (Romania)

CLAU – A Service-Oriented System for Complex Language Alignment: Architectural Aspects (pp. 49-54)

In the last years, parallel corpora have become an effective framework to study how well the linguistic phenomena and, more specifically, annotation schemata can be applied when importing the annotations from one language to the other(s). In the case of automatic import, the evaluation and correction are better to be performed by linguists using specific software. The paper proposes CLAU – a service-oriented interactive application allowing users to import, evaluate, correct, and share XML-based annotations in parallel texts. The design, general architecture, and implementation are discussed. Also, two use cases are presented: temporal annotations in parallel texts and how CLAU facilitates social Web interactions between language scientists.

 

8. Kamlesh Dutta, Nupur Prakash, and Saroj Kaushik (India)

Application of Pronominal Divergence and Anaphora Resolution in English-Hindi Machine Translation (pp. 55-58)

So far the majority of Machine Translation (MT) research has focused on translation at the level of individual sentences. For sentence level translation, Machine Translation has addressed various divergence issues for large variety of languages; the issue of pronominal divergence has been presented only recently. Since the quality of translation as required by users follows coherent multi-sentence discourse structure in a specific context, the pronominal divergence helps us in understanding the nuances of translation arising out of disparity in the languages. Subsequently using clues from this divergence, the anaphora resolution system can find the correct interpretation for the given pronominal referents and other entities by resolving the inter-sentential context. In the literature, researchers have examined the issue and have proposed ways for their classification and resolution of anaphora. However for Indic languages, not many studies are available. In this paper, we discuss different aspects of pronominal divergence that affects the anaphora resolution in English Hindi Machine Translation (EHMT). The study shall be helpful in developing approaches that can explicitly use inter-sentential information in order to resolve specific types of ambiguity and which can generate coherent multi-sentence discourse structure in the target language to produce higher quality of translation Machine Translation.

 

9. Hye-Jin Jeong and Yong-Sung Kim (South Korea)

E-Learning Content Design and Implementation based on Learners’ Levels (pp. 59-63)

The modern techniques of content design should not depend on restrictions of schedules and physical spaces. Still, the learning that depends on the contents provided from a server is difficult to implement effectively without taking into consideration learners’ levels. The learning should fit the learners’ abilities. In this study, we propose the methods of developing learning content that fits the individual levels. Evaluations for individual levels are presented as the first level and the second level. The first level presents “evaluation learning” for each paragraph of the learning, while at the second level evaluations are carried out through “Trying the following” and “Trying oneself”. “Checking Test” as part of the “sum of learning” is carried out during the first evaluation. Also “Trying oneself” is carried out as commensurate learning according to learners’ levels.

 
10. Pilar Manchón, Carmen del Solar, Gabriel Amores, and Guillermo Pérez (Spain)

Modeling Multimodal Multitasking in a Smart House (pp. 65-71)

This paper belongs to an ongoing series of papers presented in different conferences illustrating the results obtained from the analysis of the MIMUS corpus. This corpus is the result of a number of WoZ experiments conducted at the University of Seville as part of the TALK Project. The main objective of the MIMUS corpus was to gather information about different users and their performance, preferences and usage of a multimodal multilingual natural dialogue system in the Smart Home scenario. The focus group is composed by wheel-chair-bound users. In previous papers the corpus and all relevant information related to it has been analyzed in depth. In this paper, we will focus on multimodal multitasking during the experiments, that is, modeling how users may perform more than one task in parallel. These results may help us envision the importance of discriminating complementary vs. independent simultaneous events in multimodal systems. This gains more relevance when we take into account the likelihood of the co-occurrence of these events, and the fact that humans tend to multitask when they are sufficiently comfortable with the tools they are handling.