Software and Data

Here you can find some of the modules and software  that I have developed, mostly on Python and using github or bitbucket as platforms to store the source programs and scripts. This is the complete list, and below you can find a further explanation and link for each module.

  1. Wrapper for the It-Makes-Sense WSD system
  2. Converter from SemCor to NAF format
  3. WSD toy system for kids
  4. Subjectivity detector
  5. KAfNafParserPy
  6. Morphosyntactic parser for Dutch
  7. WSD system based on SVM for Dutch
  8. Lexical extractor for Dutch
  9. Treetagger wrapper for KAF/NAF files
  10. Converter from MPQA for KAF or NAF
  11. Tokeniser and sentence splitter (plain text ==> KAF/NAF)
  12. Dependency parser for Dutch based on Alpino(KAF/NAF)
  13. Fine-grained opinion miner trained on hotel reviews and news
  14. DBpedia and DBpedia-ontology enquirer
  15. WSD evaluator
  16. DBPEDIA spotlight NER + NED
  17. Terminology and n-gram Extractor
  18. WSD corpora
  19. SensEval/Semeval participant outputs
  20. Semantic Class Manager
  21. Basic Level Concepts for various WordNet versions

1) Wrapper for the It Makes Sense WSD system

This module implements a wrapper in Python around the It Makes Sense (IMS) system for Word Sense Disambiguation of English text to allow KAF or NAF files as input or output. For further information go to the URL of the module.

=> URL (github): http://github.com/rubenIzquierdo/it_makes_sense_WSD

=> Keywords: WSD , English text , KAF NAF , It Makes Sense system , IMS

2) Converter from SemCor to NAF format

Scripts to convert files in SemCor format (or the original SemCor corpus), into NAF format.

=> URL (github): http://github.com/rubenIzquierdo/semcor_to_naf

=> API docs: http://kyoto.let.vu.nl/~izquierdo/api/semcor_to_naf/

=> Keywords: SemCor converter , NAF

3) WSD toy system for kids

Toy WSD system with basic functions for training and testing a simple system with SVM

=> URL (github): https://github.com/rubenIzquierdo/wsd_for_kids

=> Keywords: WSD , toy system

4) Subjectivity detector

This repository implements a subjectivity detector based on SVM that predicts whether a sentence is subjective (contains opinions or subjective information) or not. It provides trained models and allows to train your own model.e.

=> URL (github): https://github.com/rubenIzquierdo/subjectivity_detector

=> Keywords: Subjectivity detector, SVM

5) KafNafParserPy

This module implements a parser in Python for KAF and NAF files, making very easy to parse these files, extract information from different layers and create new layers

=> URL (github): https://github.com/cltl/KafNafParserPy

=> API docs: http://kyoto.let.vu.nl/~izquierdo/api/KafNafParserPy

=> Keywords: KAF NAF parser python

6) Morphosyntactic parser for Dutch

Morphosyntactic parser for Dutch based on the Alpino parser. It takes as input a NAF/KAF file with tokens (processed by a tokeniser and sentence splitter) and generates the term layer (lemmas and rich morphological information), the constituency layer and the dependency layer. The Alpino parser is only called once to improve the performance of our module.

=> URL (github): https://github.com/cltl/morphosyntactic_parser_nl

=> Keywords: morphosyntactic parser KAF NAF Dutch Alpino

7) DutchSemCor WSD system for Dutch

Word Sense Disambiguation system for Dutch text developed on the DutchSemCor project using Support Vector Machines. The input can be plain text (for which is required to have installed TreeTagger) or KAF/NAF files. The output may be XML or KAF/NAF.

=> URL (github): https://github.com/cltl/svm_wsd

=> Keywords: WSD SVM Dutch DutchSemCor

8) Lexical pattern extractor for Dutch

This tool allows to generate a set of words that appear in the same contexts as a seed list that you can provide as input. The process consists of 2 steps:  1) Generation of patterns from a seed list and 2) Extraction of candidate words from the previous list of patterns

The current tool only words for Dutch, as it makes use of the Google Web 5-gram Database for Dutch, hosted in http://www.let.rug.nl/gosse/bin/Web1T5_freq.perl to get frequencies for n-grams. It could be easily adapted to a new language or domain, providing a new source for getting ngram frequencies in that new domain/language.

=> URL (github): https://github.com/cltl/lexical_pattern_extractor

=> Keywords: Lexical pattern extractor Dutch 

9) TreeTagger wrapper for KAF/NAF files

This module implements a wrapper around TreeTagger that allows to work with KAF or NAF as input/output files. The following languages are allowed by the wrapper: English, Dutch, German, Spanish, Italian and French, although is very easy to add new languages..

=> URL (github): https://github.com/rubenIzquierdo/treetagger_kaf_naf

=> Keywords: Treetagger KAF NAF

10) Converter from MPQA to KAF or NAF

This repository implements a converter from the original format of the MPQA corpus (annotated manually with opinions and subjective annotations) to KAF or NAF formats. It allows to automatically retokenize and POS-tag the resulting files.

=> URL (github): https://github.com/rubenIzquierdo/converter_mpqa_to_kafnaf

=> Keywords: MPQA converter KAF NAF opinion corpus

11) Tokeniser and sentence splitter

This repository implements a wrapper around the Apache open-nlp toolkit for tokenisation and sentence splitting. It allows to process a plain text as input and generate a KAF/NAF file with the text (tokens) layer.

=> URL (github): https://github.com/cltl/tokeniser-opennlp

=> Keywords: tokeniser KAF NAF open-nlp

12) Dependency parser for Dutch (based on Alpino)

This repository implements a wrapper around the Alpino parser for Dutch, allowing to use KAF/NAF files as input and generating also KAF/NAF files as output, extended with the dependency layer where all the dependencies extracted will be stored.

=> URL (github): https://github.com/cltl/dependency-parser-nl

=> Keywords: dependency parser Dutch KAF NAF Alpino

13) Opinion miner

This repository implements a fine-grained (also known as feature based) opinion miner system that extracts opinions in KAF/NAF files, creating opinion triples (Expression, Target, Holder). It has been trained using Conditional Random Fields and Support Vector Machines using the manually annotated data within the OpeNER European project for hotel and restaurante reviews as well as news.

=> URL (github): https://github.com/cltl/opinion_miner_deluxe

=> Keywords: opinion miner KAF NAF opinion expression target holder

14) DBpedia and DBpedia-ontology enquirer

This library provides classes and functions to query DBpedia using SPARQL and the DBpedia ontology. For querying DBpedia we make use of the Virtuoso SPARQL endpoint and the python library SPARQLWrapper that enables the usage of SPARQL and RDF from python. Regarding the ontology, the OWL file from DBpedia defining the ontology is automatically downloaded and exploited locally with python functions.

=> URL (github): https://github.com/rubenIzquierdo/dbpediaEnquirerPy

=> Keywords: DBpedia query ontology

15) WSD Evaluator

Performs evaluation of WSD on a list of KAF/NAF files using the official python script from senseval2

=> URL (github): https://github.com/rubenIzquierdo/wsd_evaluation

=> Keywords: WSD evaluation

16) DBPEDIA spotlight NER + NED

Performs Named Entity Recognition (NER) and linking (NED) on KAF/NAF files. It only requires the token and term layer, and it creates the entity layer with references to DBPEDIA entries.

=> URL (github): https://github.com/rubenIzquierdo/dbpedia_ner

=> Keywords: Dbpedia spotlight NER NED

17) Terminology and n-gram Extractor

This repository contains a set of scripts and tools to extract relevant terms or multiterms from a corpus of documents given a set of patterns. Specifically it allows to automatically:

  • Process the documents through a shallow pipeline in order to obtain morphological tags and lemma information for every token
  • Index the processed documents into a SQL database which will be queried by the pattern extractor.
  • Extract relevant terms or multiterms from the database given a set of patterns

=> URL (github): https://github.com/rubenIzquierdo/terminology_extractor

=> Keywords: Terminology extractor, pattern matching, SQL indexing

18) WSD corpora

Corpora used in WSD annotated with senses and converted to NAF format. You will find:

    • SemCor (with WordNet senses 1.6 and 3.0)
    • SensEval2: traditional all-words task
    • SenseEval3: traditional all-words task
    • SemEval-2010 task 17: WSD on a specific domain
    • SemEval-2007 task 17 all words
    • SemEval-2013 Task 12: Multilingual Word Sense Disambiguation (langs en,es,fr,it,de)
    • Princeton WordNet Gloss Corpus (original files are also included in the folder itself)

=> URL (github): https://github.com/rubenIzquierdo/wsd_corpora

=> Keywords: WSD corpora, NAF format, SemCor, SensEval, SemEval

19) SensEval/Semeval participant outputs

Output of all the systems participating in the previous SensEval/SemEval WSD task in XML format (also the gold keys)

    • SensEval-2 all words task
    • SensEval-3 all words task
    • SemEval2007 task 17 all words
    • SemEval2010 task 12 specific domain all words WSD task
    • SemEval2013 task 17 all words task

=> URL (github): https://github.com/rubenIzquierdo/sval_systems

=> Keywords: Senseval, Semeval, participant outputs, NAF

20) Semantic Class Manager

Code and Python API to access and query different sets of semantic classes:

=> URL (github): https://github.com/rubenIzquierdo/semantic_class_manager

=> Keywords: Semantic Class, BLC, WordNet Domains, SuperSenses

21) Basic Level Concepts

Basic Level Concepts extracted for several version of WordNet. Also the software to extract your own BLC for a different version of WordNet is included

=> URL (github): https://github.com/rubenIzquierdo/basic_level_concepts

=> Keywords: BLC, Basic Level Concepts