A Virtual Machine for NLP and CL

Introduction

This page is concerned with the details of a virtual machine image for Natural Language Processing and Computational Linguistics. Currently this is a proposal. The machine itself needs to be created once we have identified the major packages to be included in the first version. Initially this image should work with VMware player, available from http://www.vmware.com/, but we should also consider providing a version that can run in the open source VirtualBox, from http://www.virtualbox.org, as some may find this works better for them.

The Virtual Machine

Once configured, a link to the virtual machine image (or test versions) can be provided here. This should have restricted access if it includes software that cannot be redistributed to third parties. We could have a second "open" version that excludes packages and corpora that have restrictive licences.

Proposed Software

Links included only when the package is not already part of the standard Ubuntu repositories.

Hosted operating system

Ubuntu 10.04 LTS
Long Term Service Release of Ubuntu, ideally configured for automatic updates, and simplified mounting of host file systems and networked file-space. Configured to use Gnome, but perhaps also make available a version with a more light-weight desktop, such as LXDE.

Software available from the Ubuntu repositories

R
Statistics Package, with full complement of extensions for NLP. Some extensions may be missing from the Ubuntu repositories. If this is the case, they should be documented below.
Emacs-gtk
Comprehensive editor and IDE, with GTK support.
Eclipse
Software development IDE.
[Notepad++]
[Graphical text editor, but not Linux native, requires Wine. Look into alternatives.].
OpenJava
Full JDK based on IcedTea.
SWI Prolog
Near sicstus clone, with XPCE graphical interface.
Python
Interpreted programming language, for NLTK.
Perl
Scripting language
Chromium
Lightweight web browser.
Remmina
Allows access to Microsoft Windows and X desktops via RDP, VNC or NX protocols. (Latest version to be installed by adding a PPA entry.)
And …
add your suggestions here.

Software not thought to be included in the Ubuntu repositories

We need to double check whether any of these packages are actually in the Ubuntu repositories, or Ubuntu/Debian compatible private repositories (PPA's), which would then allow them to be managed using the Synaptic/apt-get package management system.

NLTK
Python based natural language toolkit. http://www.nltk.org
jLSA and SVDlib
Tools for performing Latent Semantic Analysis. ?link? No longer available? or alternatives, see: http://en.wikipedia.org/wiki/Latent_semantic_analysis. Another Java LSA package is available from http://code.google.com/p/airhead-research/downloads/list which includes a Java port of the SVD libraries. See http://code.google.com/p/airhead-research/wiki/LatentSemanticAnalysis for more details.
Random Indexing
LSI/PLSI-like dimensionality reduction Semantic Vectors. See http://code.google.com/p/semanticvectors/
Semantic Engine
Recall search engine C++ library, supports document similarity clustering. http://code.google.com/p/semantic-engine/
Gate
NLP infrastructure. http://gate.ac.uk/
Uima
NLP infrastructure. http://uima.apache.org/
Rouge
Evaluation metrics for summarisation. http://berouge.com/default.aspx
Cluto
Clustering software. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview and visualisation http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/download
Brill Tagger
The original http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/taggers/brill/0.html and the GPoSTTL Enhanced Brill Tagger. http://sourceforge.net/projects/gposttl/
Stanford Software
Including the Stanford Parser etc. http://nlp.stanford.edu/software/lex-parser.shtml
Other software
Packages mentioned at http://nlp.stanford.edu/links/statnlp.html
Corpora
Smaller corpora for NLP and CL, ?links?, and documentation for adding larger corpora, or corpora that have a restrictive licence. See the resources page.
And …
add your suggestions here.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License