Open Fact Extraction Datasets for Spanish and English

Parallel English-Spanish corpus for fact extraction. Note that there are two sheets in the Excel file: Spanish and English. The texts were collected from Mexican school textbooks (in Spanish) and manually translated into English; thus the sentences are coherent and grammatically correct. In each language, for each sentence, all manually extracted fact triples are given. See details in the paper(s) mentioned below.

Raw Web Spanish corpus for fact extraction. Note that there are various sheets in the Excel file. The texts were randomly collected from Internet, with no attempt to filter out incoherent or grammatically incorrect constructions. For each sentence, all manually extracted fact triples are given. See details in the paper(s) mentioned below.

News 300 dataset markup. A ZIP file with various text files.

These are the datasets mentioned in the publications:

Alisa Zhila, Alexander Gelbukh. Comparison of Open Information Extraction for English and Spanish. Submitted.
Alisa Zhila. Open Information Extraction using Constraints over Part-of-Speech Sequences. PhD thesis. Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico, 2014.

LICENSE: The data are free to use for academic and research purposes, provided that in all publications obtained with the use of these data references to the publication(s) mentioned on this page are given (please contact the authors for a complete reference) and in all products obtained with the use of these data proper attribution and a reference to the corresponding webpage is given unless the nature of the product does not allow it. Datasets derived from these data should be released under similar terms requiring the same references. For other uses, please contact the authors.

CONTACT: Alisa Zhila, Alexander Gelbukh.