Simple tagger

Igor Bolshakov, Alexander Gelbukh, and Sofia Galicia-Haro. A Simple Method to Detect and Correct Spanish Accentuation Typos. Proc. PACLING-99, Pacific Association for Computational Linguistics, University of Waterloo, Waterloo, Ontario, Canada, August 25-28, 1999, ISBN 0-9685753-0-7, pp. 104-113.

A Simple Method to Detect and Correct
Spanish Accentuation Typos

I. A. Bolshakov, A. F. Gelbukh, and Sofía N. Galicia-Haro

Natural Language Laboratory, Center for Computing Research (CIC), National Polytechnic Institute (IPN),
Av. Juan de Dios Bátiz, esq. Mendizabal, Zacatenco, C.P. 07738, Mexico D.F., Mexico.
{igor, gelbukh, sofia}@pollux.cic.ipn.mx

A frequent typographic error in Spanish connected with omission of a stress mark is studied. It transforms one existing word to another existing one, and cannot be detected by usual spell-checkers. A simple recovering procedure is proposed relying on conforming of a specific 4-word context to a noun or adjective and on a closed list of pairs of words that differ only in accentuation mark, with one of them being a verb and the other being a noun or adjective of certain gender and number. When one of the words of such a pair is found in the text, the algorithm checks its context to detect whether it corresponds to a noun or adjective; if the condition is not satisfied, the alternative hypothesis of a verb is supposed. The idea is applicable to numerous nouns or adjectives like número or máquina that pass to quasi-homonymous personal verb forms when losing their stress marks. An exhaustive list of 300 quasi-homonyms is given.

Key words: natural language processing, spell-checking, Spanish, accentuation.

Introduction

The usual spell-checkers consider each word out of its context, i.e., independently of its environment. With such a strategy, only those typographic and orthographic errors can be detected that change an existing word to a senseless string of letters that does not exist in the given language. As to the errors converting one existing word to another existing one, the changed words stay unnoticed and make the text in essence ungrammatical. For their reliable correction, advanced grammar checkers are to be used, with a full-scale syntactic parser, which is not always affordable. To solve this problem, different heuristic methods, among others phonologic, morphological, syntactic, statistical, and knowledge-based ones, were proposed.

The purpose of application of these methods is principally to formulate, basing on the context, some conditions for the word in question and then check whether the word satisfies them. In the manner the context is analyzed, this task is similar to lexical disambiguation.

One of the best-known disambiguation tasks is context-based part of speech tagging. It is useful for spell checking, especially for wide coverage spelling error detection and correction. For example, Elmi and Evens (1998) used a combination of dictionary-based methods and context part of speech analysis for designing a tutorial system to help medical students to learn the language used in cardiovascular physiology. A part of speech tagger was used to disambiguate contexts with adjacent part of speech sequences.

There are different methods for part of speech tagging, mainly linguistic, statistical, and machine learning ones. There are also hybrid methods that combine different approaches; for example, Tzoukerman et al. (1994) have used statistical and knowledge based resources. Nevertheless, though methods of full part of speech tagging are well known, they are too complicated and resource-demanding for the applications in spell checking.

In some works on wide coverage spelling checking, a combination of different methods was used to increase the detection and correction accuracy. Golding and Schabes (1996) used a hybrid method based on part of speech trigrams and context features. Agirre et al. (1998) proposed a multi-resource method based on syntagmatic, paradigmatic, and statistical knowledge for context sensitive spelling correction, which proves to work better as wider context is taken into account.

In this paper, we suggest a method applicable to a specific kind of errors. In a real spell-checker, it is to be used in conjunction with or in addition to well-known methods for dealing with other types of errors, which we here neither consider nor discuss. The proposed method is a combination of the method of confusion lists with an extremely simplified context-based part of speech tagging. Namely, we suggest some heuristics to solve the spell-checking problem for a specific group of Spanish words.

In Spanish, some errors fully undetectable out of context are connected with accentuation rules. For example, the phrases este artículo tiene … ‘this article has…’, or el número del adjetivo… ‘the number of the adjective…’, or las páginas siguientes… ‘the following pages…’ would be considered correct by a spell-checker even if the underlined words lost their stress marks: *este articulo tiene…‘this I join has…’, *el numero del adjetivo…‘the I enumerate of the adjective…’, *las paginas siguientes…‘the you paginate following…’. Indeed, the words articulo, numero, and paginas are correct Spanish verb forms, though fully unacceptable in the mentioned contexts.

This type of errors is quite usual for foreigners and is very common in informal Spanish writing, especially in the Internet. Thus, it appears in large text corpora collected from the Internet. One of the authors, who is Russian, has made more than 60 accentuation errors in his first paper in Spanish, and only half of them were detected by the spell-checker of Word for Windows, version 6. A newer version of Word for Windows uses a grammar checker to detect more errors, but nevertheless it missed most of the mentioned errors either.

Many other languages also employ diacritic marks extensively, among them Polish, Czech, Hungarian, and French. In these languages, the omission of a diacritic mark usually converts a valid word to a meaningless string, as well as in many cases in Spanish itself, as with words énfasis, ambigüedad, and niño after replacing é, ü and ñ with e, u, and n, correspondingly. For Polish, Czech, and Hungarian, such cases prevail and thus can be corrected by a usual spell-checker. Meanwhile, in French (cf. noun pose versus participle posé) and especially in Spanish there are numerous groups of words, which are valid words both with and without specific diacritic marks. Here we are interested not in all possible diacritics, but only in accentuation marks and, in this article, only in Spanish. With some other types of diacritics or in some languages other than Spanish, well-known methods of spelling correction are applicable. In some other cases, especially with diacritics other than accent marks (e.g., Spanish soñar vs. sonar), the methods described here are not applicable.

Yarowsky (1994) proposed a method for restoration of accent marks in arbitrary French or Spanish words, basing on statistically proved contexts. This method requires very large and strictly well-formed learning corpora. Learning gives vast context collections for various word forms. Even after compressing these collections by means of a morphological analyzer and a semantic generalizer supported by a dictionary, the tool used for accent mark adjusting remains somewhat cumbersome.

To our knowledge, in another accent restoration program Daciuk (1998) used a similar statistical method, while Gutting et al. (1992) used hidden Markov chains. All these programs require large tables and dictionaries. Meanwhile, the methods used by commercial products like Spanish version of WordPerfect or French Le Correcteur of Les Logiciels Machina Sapiens were not published.

Our method uses very restricted set of word endings and a small closed list of relevant words that can be processed properly. At the same time, our method is applicable only to pairs opposed in their parts of speech, namely, noun or adjective versus verb. Hence, it does not pretend to universality, so that any confusion with verb pairs like Spanish presento vs. presentó or hable vs. hablé cannot be resolved by it regularly.

1. Linguistic information used by the algorithm

To repeat, Spanish accent marks are used in such a way that they often distinguish between the words of different parts of speech. These words have the endings ‑o, ‑a, ‑as, ‑e, ‑es, both the nouns and adjectives on the one hand and usual personal verb forms on the other hand.

Our method employs a kind of a part-of-speech tagger, that can set only one mark, “possible adjective or noun,” leaving all the other words unmarked. Part-of-speech taggers usually take into account the linear context for disambiguation part-of-speech homonymy (Yarowsky 1994). Fortunately, Spanish presents good conditions for detecting the noun or adjective context: unlike English, the articles are used in Spanish much more frequently, and there is grammatical agreement between the nouns, adjectives, and articles.

The main group of error-prone Spanish words with stress marks contains those nouns or adjectives that turn to personal verb form if the stress mark is omitted. At the first stage of our study, the quasi-homonymous pairs “noun or adjective versus verb” were collected through the manual search in large Spanish dictionaries, such as of Academy of Madrid (Diccionario de la lengua española 1996) and Anaya Group (Diccionario del español contemporáneo 1997). It gave us about 150 pairs and stimulated further search. Then a program was written for automatic extraction of such nouns and adjectives from large electronic dictionaries. Thus, the full amount of pairs reached nearly 300. The accented counterparts of all the gathered pairs are given in the Appendix. The common features of the pairs are the following:

· The pairs are Spanish words of middle statistical rank, and many of them are quite common is scientific and technical literature. Note that the noun in each pair is usually more frequent than the corresponding verb.

· An accentuated counterpart is a specific word form of a noun (like cómputo), of an adjective (like legítimas) or of a vocable combining two homonyms, a noun and an adjective (like crítica ‘the criticism’ or feminine ‘critical’).

· A non-accentuated counterpart is a specific personal form of a verb in singular. Therefore, the components of each pair correlate not as a lexeme to lexeme, but as a word form to another word form. Hence, all the corresponding pairs should be considered separately.

· Each accentuated form, independently of its part of speech (noun, adjective or homonymous vocable), is characterized by a specific combination of number and gender, e.g., ánimo (sing, masc), específica (sing, fem), cópulas (plur, fem), partícipes (plur, masc). This combination is fixed for each pair and can be assigned to it beforehand.

Thus, the entire information about a pair may be represented in the computer form as a triple (accentuated counterpart, number, gender). The non-accentuated counterpart can be easily computed from its accentuated counterpart through simple replacement of the codes of accentuated letters to their non-accentuated analogs. The structure of the list is described in more detail in the Appendix.

2. The main algorithm

The main algorithm scans the text, word by word, and uses a technique known in some style checkers (Ashmanov 1995): instead of checking the correctness of the current situation in the text, a hypothesis is formed and then checked about a possible error in the current place. If the hypothesis looks reasonable, a possible error is reported to the user.

In our case, either a verb or a non-verb can be used in a specific context, while the contexts equally suitable for both cases are rather rare. When a doubtful verbal context is found, a hypothetical noun or adjective is supposed. If the context is good for it, then this context should be bad for the original verb.

While scanning, two forms of each word are matched against the two list entries, an accentuated one and automatically computed its non-accentuated counterpart. The characteristics of the hypothetical noun or adjective, namely, its gender and number, are retrieved from the same list.

Let the word under consideration be w₀, its immediate linear context be the sequence w_-1,w₀, w₁, w₂, and the variable be the hypothetical noun or adjective. Then the work of the algorithm depends in which form of the word was found.

· If the word was found in the accentuated form, it is considered to be a noun or adjective and the suitability of the immediate context is checked as described in the next section; the variable is set to w₀.

· If the word was found in the non-accentuated form, it is considered to be a verb. Since we cannot check the context for a verb, a hypotheses is considered that the true intended word was the corresponding accentuated counterpart of the non-accentuated word w₀, i.e., a noun or adjective. The variable is set to this accentuated counterpart, and the context is checked for this hypothetical word. If the context is suitable for it, the hypothesis is accepted and a possible error is reported.

When a possible error is reported, the user is asked, in an interactive manner, whether the doubtful word w₀ should be replaced with its counterpart.

3. Check of the context

For the algorithm, a procedure is necessary to check, for the given word , which is supposed to be a noun or adjective in the form of already known gender g and number n, whether a specific 4-word immediate linear context w_-1,w₁, w₂, such that the word order in the text is w_-1,, w₁, w₂, is suitable for a noun or adjective with these gender and number. Here is either the current word w₀considered by the main algorithm, or its accentuated counterpart, as it was described above.

Let Preps be the list of all simple (one-word) prepositions, Preps = {a, de, con, por, sin, en, sobre, para,…}, and Dets_g,n be the list of quasi-determinatives that depends on the gender g and number n of w₀ according to the following table:

	Singular	Plural
Masculine	un, el, este, ese, aquel, mi, tu, su, al, del, buen, mal, primer, gran	unos, los, estos, esos, aquellos, mis, tus, sus, buenos, malos, primeros, grandes
Feminine	una, la, esta, esa, aquella, mi, tu, su, buena, mala, primera, gran	unas, las, estas, esas, aquellas, mis, tus, sus, buenas, malas, primeras, grandes

Only the case of singular masculine has some peculiarities connected with the use of short forms of adjectives and contracted articles al, del. We should admit, though, that the presence of the word la ‘the’ in the table is questionable because it is homonymous with the accusative pronoun la ‘her’. We decided to include it in the table because of statistical considerations.

Let us use the notation u ~ v for grammatical agreement of words u and v in gender and number, i.e., for the fact that the first word form has or could have (if it is ambiguous) the same gender and number as the second one. The procedure implementing this check is described in the next section. Note that in all cases, the characteristics of the second word v are already known.

The word is considered to be a noun or adjective properly used in the given context, and thus the word w₀ ¹ is considered likely to be an error, if any of the following four conditions is satisfied:

1. w_-1 Î Dets_g,n È Preps, or

2. w_-1 ~ , or

3. w₁ ~ , or

4. w₁ Î {más, mas, menos}and w₂ ~ .

The tests should be carried out in the given order, that helps to cope with the combinations like el número y el género gramaticales, where the agreement is more difficult to check. In the condition 4, the word mas (without accent) is tried only in case if the accent marks are totally lost in the text; really we mean here the word más ‘more’ that has lost its accent rather than the very rare Spanish word mas ‘but’.

Since the gender and number of are known, to check the agreement in the conditions 2 to 4, it is enough to check whether the corresponding word w_-1, w₁, or w₂ is compatible with the hypothesis about its number and gender.

4. Morphological agreement check

For the algorithm described above, a procedure is necessary to check whether the two given words agree in gender and number, or, more precisely, since the characteristics of one of the words are already known, it is enough to determine if the given word has the given characteristics; let us denote this fact as w ~ (number n, gender g).

The strict implementation of this procedure implies availability of a morphological analyzer of Spanish nouns and adjectives based on a morphological dictionary. However, it turned out that in most cases nearly the same results might be achieved through a rougher approximation taking into account only short lists of final substrings of the word. Namely, the following Spanish endings usually indicate the following characteristics:

	Singular	Plural
Masculine	‑do, ‑to, ‑ismo, ‑ero, ‑rio, ‑je, ‑te, ‑al, ‑il	‑os, ‑es
Feminine	‑a, ‑ión, ‑ad, ‑te, ‑al, ‑il	‑as, ‑es

The lists of endings intersect for masculine and feminine, both in singular (‑te, ‑al, ‑il) and plural (‑es), however, just as with the use of demonstrative pronouns mi, tu, su, it does not create any problems. It is more dangerous that there exist nouns of masculine gender ending in -a, such as el problema, el/la deportista, and very few nouns of feminine gender ending in -o, such as or la mano, la foto, la radio. However, such situations are rather rare and can be handled by the corresponding lists of exceptions.

5. Some other types of words

There are other error-prone words related to the accent, which are not processed by the general mechanism described above. Namely, these are the words of the parts of speech different from verbs, nouns, or adjectives. For each group of such words, a special case of the algorithm is necessary. Here we describe two heuristics that work for one pair of words each.

· Tú. If w₀ = tu, then w₁ must have the characteristic “singular”, which is checked by the condition w₁ ~ (masculine, singular) or w₁ ~ (feminine, singular). Otherwise, a possible error should be reported.

· Mí. If w₀ = mi, then w₁ Ï Preps and w₁ must have the characteristic “singular”, otherwise, a possible error should be reported.

Though these groups contain only one pair each, the words in these pairs are much more frequent in texts than other quasi-homonyms, that justifies the introduction of these special cases to the algorithm. There are other similar words, such as sólo, él, más, cómo, sí, but their handling is more difficult.

Another type of errors is the case of the pronouns that require accent in the interrogative sentences, such as ¿Qué…, ¿Cómo… ¿Cuándo…, etc. There is a very little closed list of such words, and the error of this type can be detected by rather simple heuristics. For example, after a ¿ sign the words like qué always require an accent. We do not discuss here these heuristics.

6. Experimental results

The algorithm was realized in a complete program consisting of 27 subprograms in Pascal. It includes the scanner of the input text and the module of dialog with the user for interactive correction of the reported errors in the text.

We conducted two experiments, one with a small text and full manual analysis of the results, and another with a large text corpus, with semi-automatic statistical analysis of the results.

In the first experiment, a real unprepared text of scientific genre written by a foreigner and consisting of 9 pages was analyzed[1]. After application of the Microsoft Word spell-checker, as many as 35 such errors remained undetected in it. The program detected 32 of them, i.e., 91.4% of the errors of the type under consideration not detected by the commercial spell-checker. Only one of the missed errors corresponded to quasi-homonyms from the list: *no es practica, the correct form being no es práctica; two other ones were connected with the word sólo, e.g.: *el modelo solo dificulta, the correct form being el modelo sólo dificulta.

In the second experiment, we used a large Spanish corpus. Since there is no special Spanish corpora to compare the results produced by our method, we took as our corpus a collection of unprepared technical texts from Gaceta UNAM (Mexico) and from the Internet, and then employed the following procedure:

1. The entire corpus was passed through the Microsoft Word spell-checker to eliminate orthographic and typographic errors. Among orthographic errors, mainly unknown words were found. When variants of correction were suggested, the first one was automatically accepted. Our experiment was then conducted with this corpus.

2. All stress marks throughout the corpus were removed.

3. Once again, the full corpus was passed through the Microsoft Word 97 spell-checker. We disabled the Grammar check option since in many cases it reported false alarms, and the first experiment has shown that this spell-checker rarely detects the errors of the type under consideration with grammar checking (see below). For each reported orthographic error, the first suggested variant of correction was automatically accepted.

4. Finally, the entire corpus was processed by our program and all its suggestions were automatically accepted.

The resulting corpus after the pass 4 was not quite identical to the one obtained after the pass 1; thus, some errors of the type under consideration remained undetected. Namely, with the passes 2 to 4, we achieved about 92% of error detection, whereas only Microsoft Word tests gave approximately 82%.

Most of the remaining errors not detected neither by Word spell checker nor by our program are related with the más/mas and sólo/solo cases, and some with the qué/que type discussed in the previous section. Here is an example of missed errors: *consideran totalmente validas, the correct form being válidas. In some cases our program reported false alarms: for example, it marked the word participe in the phrase cada uno participe en la educación as a possible error while in this context it was a true verb form.

There was only one difference found between the performance of Microsoft Word version 6 versus version 97 that includes a syntactic-based grammar checking. When the article el appears immediately before the type of error under consideration, the version 97 reports a possible error in this article and proposes to change it to the pronoun él. If this proposal is accepted, then with phrases like el articulo a new, induced error is reported: after changing it to él articulo ‘he I join’, the spell-checker suggests changing it to el articula ‘he joins’, which has the meaning totally different from the intended el artículo ‘the article’.

7. ConclusionS

Our algorithm of accent restoration has the following advantages:

· It is hypotheses-driven, i.e., based on the active search for errors rather than on passive check of grammaticality of the given text.

· It uses small closed word lists and a very small list of endings.

· Implementation of the algorithm is very simple; it can be implemented independently and used after a regular spell-checker.

The idea similar to the one described here can be used to detect other types of errors related to the agreement in number and gender in Spanish, between adjacent nouns, adjectives, adjectival pronouns, ordinal numeral, articles, etc.

Appendix. The list of quasi-homonyms

The entries of the quasi-homonyms dictionary have the following format:

Word	Gender	Number
específico	masculine	singular
específica	feminine	singular
específicas	feminine	plural
célebre	both	singular
célebres	both	plural

No difference is made between adjectives and nouns; when one of the forms is a noun while the other are adjectives, this is not marked since this does not affect the gender and number attribution. The masculine nouns are given only in one form, since only this form can be confused with a verb.

In the following list, the characteristics are not given since they can be easily restored by the reader. When the value of some characteristics is “both”, two hypotheses should be tried by the algorithm. The word forms with common stem, and usually of the same lexeme, are grouped together: específico, -a, -as; célebre, -es. Here is the complete list:

acídulo, -a, -as

adúltero, -a, -as

ágora, -as

álabe, -es

alígero, -a, -as

ánimo, -a, -as

apócope, -es

ápodo, -a, -as

apóstata, -as

apóstrofe, -es

apóstrofo

árbitro, -a, -as

artículo

auténtico, -a, -as

báscula, -as

beatífico, -a, -as

cálamos

cálculo

cántara, -as

capítulo

cápsula, -as

catálogo

célebre, -es

centrífugo, -a, -as

círculo

cláusula, -as

coágulo

cómputo

cópula, -as

crítico, -a, -as

cronómetro

décimos

decrépito, -a, -as

décuplo, -a, -as

depósito

desánimo

diagnóstico, -a, -as

diálogo

doméstico, -a, -as

dómine, -es

ejército

émbolo

émulo, -a, -as

epílogo

equívoco, -a, -as

específico, -a, -as

espontáneo, -a, -as

estímulo

estípula, -as

estómago

estrépito

fábrica, -as

filósofo, -a, -as-

fórmula, -as

gárrulo, -a, -as

género

gráfico, -a, -as

gránulo

hábito

hidrógeno

homólogo, -a, -as

idólatra, -as

ilegítimo, -a, -as

ímprobo, -a, -as

incómodo, -a, -as

íncubo

íntegro, -a, -as

interlínea, -as

intérprete, -es

íntimo, -a, -as

inválido, -a, -as

júbilo

lágrima, -as

lámina, -as

lápida, -as

lástima, -as

légamos

legítimo, -a, -as

letífico

líbero

lícito, -a, -as

límite, -es

línea, -as

líquido, -a, -as

lúbrico, -a, -as

mácula, -as

magnífico, -a, -as

máquina, -as

matrícula, -as

módulo

monólogo

náufrago, -a, -as

nómina, -as

núcleo

número

ópera, -as

óptimo, -a, -as

órbita, -as

óvalo

óvulo

óxido

oxígeno

pacífico, -a, -as

página, -as

pálpito

páramos

partícipe, -es

pátina, -as

pérdida, -as

petróleo

plática, -as

práctico, -a, -as

prédica, -as

préstamos

pródigo, -a, -as

prólogo

pronóstico

prórroga, -as

próspero, -a, -as

público, -a, -as

púrpura, -as

purpúreo, -a, -as

recíproco, -a, -as

réplica, -as

réprobo, -a, -as

reválida, -as

rótula, -as

rúbrica, -as

simultáneo, -a, -as

síncopa, -as

síncope, -es

síndico

solícito, -a, -as

sólido, -a, -as

subtítulo

súplica, -as

tálamos

témpano

témpera, -as

término

térreo, -a, -as

título

tráfago

tráfico

trámite, -es

tránsito

trépano

triángulo

úlcera, -as

último, -a, -as

válido, -a, -as

vínculo

vómito

[1] For the experiment, the initial draft of the following text was used: Bolshakov, I. A. El modelo morfológico formal para sustantivos y adjetivos en el español. Computación y Sistemas, No. 1, 1996, p. 27-35.