Implicit Aspects Indicators Extraction Tool

Webpage: http://www.gelbukh.com/resources/implicit-aspect-extraction-corpus/

This is the tool used for Implicit Aspect Indicator Extraction as described in the paper: 

    Ivan Omar Cruz-Garcia, Alexander Gelbukh, Grigori Sidorov. "Implicit Aspect Indicator Extraction for Aspect-based Opinion Mining", 2014, submitted.

LICENSE: The software can be used free of charge for non-commercial academic purposes, and can be modified provided the derived products give proper credits to the original authors and cites this licence, including the full reference to the abovementioned paper. Any publication that benefits from this software or data should cite the abovementioned paper and the papers listed on http://www.gelbukh.com/resources/implicit-aspect-extraction-corpus/. If you fix bugs, modify this software, or build on it, we will be grateful if you pass us the modified version for us to make it available, or link to it, from the abovementioned site.

These tools use Java (we used JDK 7) and Python 2.7. The baselines use NLTK (www.nltk.org) 

FILE DESCRIPTION

There are 2 directories included:

    -   Implicit Aspect Indicator Extractor: This folder contains the Implicit Aspect Indicator Extractor Files.
    
    -   Baseline: This folder contains the baselines used for performance comparation in the article.
    
Implicit Aspect Indicator Extractor Folder Files:
    
    -   stanford-ner-2013-06-20.jar: This is the Stanford NER  v3.2.0 Java file. The IAI extractor uses the CRFClassifier included in this file.
        For more information about the Stanford NER and how to use it the CRFClassifier go to http://nlp.stanford.edu/software/CRF-NER.shtml
        
    -   CRFImplicitAspectExtractor.prop: this is the CRFClassifier configuration file. For more information about 
        this file go to http://nlp.stanford.edu/software/CRF-NER.shtml

    -   TrainDataX.tsv - TestDataX.tsv: these are the training and testing data files. Basically the texts of the corpus described in the paper were divided in 10 pieces for a 10-fold 
        cross validation experimental setup. For a single fold, 80% of the corpus is used for training and the rest is used for testing. The following diagrams
        describes how the data was divided for each fold.
        
        |----------|  <-  Whole corpus texts
        -  <-  Training Data
        +  <-  Testing Data
        
        *   TrainData0.tsv - TestData0.tsv: |++--------|
        *   TrainData1.tsv - TestData1.tsv: |-++-------|
        *   TrainData2.tsv - TestData2.tsv: |--++------|
        *   TrainData3.tsv - TestData3.tsv: |---++-----|
                            .
                            .
                            .
        *   TrainData8.tsv - TestData8.tsv: |-------++-|
        *   TrainData9.tsv - TestData9.tsv: |--------++|

    
    -   CRFImplicitAspectExtractor.template: this is the CRFClassifier prop file template. This is because the prop file indicates the training data file to the CRFClassifier, and with each fold
        the training data file changes. So this tool uses this file as a template to generate each prop file for each fold, only changing the Training Data File value.
        
    -   CRFImplicitAspectExtractor.template-XXX-XXX: these are the CRFClassifier prop template files for each one of our experiments:
        
        *   CNG: Character N-Gram Features. 
        *   CNTX: Context Features. 
        *   WT: Word-Tag Features.
        *   CLS: Class features.
        
        For more info see the paper
        
    -   CrossValTrainTest.py: This python script generates each .prop file for each fold and a TrainAndTestX.bat file for each fold. The TrainAndTestX.bat file executes the experiment with the fold X
        
Baseline: this folder is a pydev project folder. It can be imported as a python project in eclipse and the pydev plugin

    -   Data Folder: It has the IAI annotated corpus, the original corpus and some folder for intermediate files
    -   Source Folder: The source files
        *   SentWordsBaseline.py: The baseline BSLN1 described in the article.
        *   Baseline-Sent.py: The baseline BSLN2 described in the article.
        *   nltk_classifiers.py: The Naive Bayes classifier
        *   CorpusStatistics: the tools used to get the corpus statistics described in the paper
        *   classifier.pickle: the classifier serialization file.
        
HOW TO USE THE IAI EXTRACTOR

    -   Select a .template file to test (For example CRFImplicitAspectExtractor.template-WT)
    -   Copy this file in the same folder and rename it to CRFImplicitAspectExtractor.template
    -   Execute CrossValTrainTest.py
    -   Execute a TrainAndTest file of an specific fold. (For example TrainAndTest3.bat)