A Collection of Tools and APIs for Language Processing

Sanchay is an open source platform for working on languages, especially South Asian languages, using computers and also for developing Natural Language Processing (NLP) or other text processing applications. It consists of various tools and APIs for this purpose. It is still in the development stage and the design has not yet stabilized, but components like a text editor with customizable support for languages and encodings, annotation interfaces, etc. was first released as an experimental version (0.1) on Sourceforge.net. The next version (0.2) has been available on the Internet and has also been released on Sourceforge.net, along with the latest version (0.3). It is meant to be complementary to the other existing NLP tools and libraries.

Some of the components in the released version are: Syntactic annotation interface, generalised table and tree components, SSF (Shakti Standard Format) API, feature structure API, parallel corpus markup interface, customizable language and encoding support, Sanchay text editor, language and encoding identification, file splitter and format converter, task setup generator (only for syntactic annotation), a simple but powerful data structure called Properties Manager along with a GUI for purposes like customization of applications, a find/replace/extract tool, a CRF based automatic annotation tool, and a tree visualizer for phrase structure and dependency relations. Some recent additions include Sanchay Corpus Query Language (SCQL) and the Sanchay Shell. User documentation has been provided for some of these components. More will be added soon. Some API doc umentation for programmers will also be provided later.

Sanchay has an object oriented architecture where the emphasis is on a design based on things like modularity, reusability, extensibility and maintainability. The implementation is purely in Java, which means it is platform independent and can be used on Windows as well as Linux without needing any extra setup except installing JDK or JRE.


How do I Start Using Sanchay?


First ensure that the Sun (now Oracle) JDK 1.6 is installed on your system (the JDK that comes with Linux might not work).

After that, download the Sanchay zip or tar file from the Downloads or the Latest Builds section. If you want the latest version, it is better to go to the Latest Builds section.

Then, before starting, preferably [optionally] create a directory named 'sanchay' and extract the Sanchay zip or tar file into it. This directory ('sanchay') could be the place where you put things related to Sanchay. On extracting the zip/tar file, another directory (e.g. Sanchay-03-05-11) will be created inside the 'sanchay' directory, which will be the actual Sanchay directory from where the program would start.

The description below uses the Syntactic Annotation Interface as an example application in Sanchay that you might want to use.

1. Start


user@host:~$ cd sanchay/
user@host:~/sanchay$ cd Sanchay-03-05-11/
user@host:~/sanchay/Sanchay-03-05-11$ sh Sanchay.sh


Go to the folder created when the Sanchay zip (or tar) file was extracted.

Find the file named Sanchay.bat (It might be displayed as just Sanchay with a gear wheel as the icon). Double click on it.

Brings up the Sanchay GUI.

2. Go to the Syntactic Annotation Interface

Look for the button labelled SA and click on it.

Brings up the Syntactic Annotation Interface.



3. Find the 'Open' button and click on it.

Brings up the file browser.

(The language and encoding can also be selected from here.)


4. Open the file to be annotated.

Click on the 'Browse' button and select the file. Press OK.

4. (a) If the file is raw text (just simple unannotated text, nothing sinister), it will take you to an editor where you can edit the text so that one sentence is on one line. You can also correct the segmentation or words etc. If you do any editing here, righ click and select Save. Then close the text editor window.

(b) You will be asked whether you want to run the POS tagger. (For testing purposes, select No as the trained model may not be present).

(c) You will now be asked whether you want to run the Chunker. (For testing purposes, select No as the trained model may not be present).

Will open the file in the interface.

(Check the number of sentences to see if all the data is loaded properly)

5. Start annotation

Come on! Give it a try.


(The developer has been lazy enough to fix the weird title style very late on this website)