Archive for the ‘SpellChecker’ Category

Aspell and Hunspell

September 17, 2009

Aspell and Hunspell are the spellcheckers.

In that Hunspell have only 13000 words in dictionary for spell checker, in that we try to add 100000 words for Aspell and Hunspell  Tamil dictionary.

The Tamil words are get from Lexicon Tamil Dictionary.

This working for aspell:

Get the svn for aspell:

links:http://packages.debian.org/source/sid/aspell-ta

The following the command to check out the source of aspell through svn:

svn://svn.debian.org/debian-in/aspell-ta/trunk

create the new folder:

$ mkdir aspell

$ cd aspell

$ svn co svn://svn.debian.org/debian-in/aspell-ta/trunk

$ cd trunk

$ ls

configure  Copyright  doc   Makefile.pre  ta.cwl  tamil.alias  u-taml.cmap
COPYING    debian     info  README        ta.dat  ta.multi     u-taml.cset

$ preunzip ta.cwl

$ ls

configure  Copyright  doc   Makefile.pre  ta.dat       ta.multi  u-taml.cmap
COPYING    debian     info  README        tamil.alias  ta.wl     u-taml.cset

$ gedit ta.wl

Add the Tamil word and save it.

$ prezip ta.wl

$./configure

$make

$ls                                                                 //The rws file is created now

configure  debian  Makefile      ta.cwl     tamil.alias  ta.wl~
COPYING    doc     Makefile.pre  ta.cwl.bk  ta.multi     u-taml.cmap
Copyright  info    README        ta.dat    ta.rws u-taml.cset

Here we are going to add the rws file to local aspell-ta dictionary and check with our system.

Install the aspell on fedora:

# yumdownloader –source aspell
# yumdownloader –source Aspell-ta

In default the aspell is located /usr/lib/aspell-0.60/ in this folder contain the ta.rws(This is original word contain in aspell)before going to move the new(ta.rws)file here,backup the already ta.rws file

#cp ta.rws ta.rws.bk

and move the ta.rws file here.Now all the words are added into the aspell dictionary.

To Check:

# aspell -d ta.rws -a
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)
மாலதி                                        //This word is in the list
*
சுஜி                                            //சுஜி this word not in the list so it give other  options

& சுஜி 10 0: சோஜி, சுகி, சுசி, சுதி, சுனி, சுரி, சுளி, சுழி, சுவி, சப்ஜி

அஃகம்                               //This word is  in the list
*

NOTE:

The individual word lists have an extension of “.cwl” and are compressed to save space. To uncompress a word list use  “preunzip BASE.cwl” which will uncompress it and rename the file  to “BASE.wl”. To dump a compressed word list to standard output use “precat BASE.cwl”. To uncompress all word lists in the current  directory use “preunzip *.cwl”. For more help on “preunzip” use “preunzip –help”.

Hunspell:

Download the wordxtr:

link:https://fedorahosted.org/wordxtr/       //in this links get the tar file (or)

#yum install wordxtr

#wordxtr ta_IN TAMIL                                                       //ref:Note1

Creating dictionary for language “ta_IN” using text data in directory “TAMIL”
00%….Creating Text Data to Parse
25%….Reading Text Data to Parse
50%….Created Text Data to Parse
65%….Extracted words from input Text Data
80%….Removed duplicated words from extracted wordlist
Basic ta_IN.dic and ta_IN.aff created
……in current directory

NOTE 1:

TAMIL is the plain folder contain only the text file of the Tamil words.ta_IN is language code forTamil.

The ta_IN.dic and ta_IN.aff files are created in current directory,it move to the Hunspell location /usr/share/myspell/.now all the words are added in to the Hunspell dictionary.

Advertisements