Adding Ispell support to mnoGoSearch ================================== Intoduction =========== When mnoGoSearch is used with ispell support all words are normalized by both indexer and search frontend. It allows to find same words with different endings. For example, if words "testing" or "tests" are found in a document, the word "test" will be stored by indexer instead. Search frontend will also try to find the word "test" if "testing" or "tests" is given in search query. Note that this scheme lacks exact search possibility, but usually reduces database size and makes search faster. Note that if you add ispell support to already existing database, reindexing is required. In other case non-normalized words will not be found at all. Two types of ispell files ========================= MnoGoSearch understands two types of ispell files: affixes and dictionaries. Ispell affixes file contains rules for words and has approximately the following format: flag V: E > -E,IVE # As in create > creative [^E] > IVE # As in prevent > preventive flag *N: E > -E,ION # As in create > creation Y > -Y,ICATION # As in multiply > multiplication [^EY] > EN # As in fall > fallen Ispell dictionary file contains words themselves and has the following format: wop/S word/DGJMS wordage/S wordbook wordily wordless/P Ispell prefixes support ======================= Only suffixes are supported by default. Prefixes usually change word meanings, for example if somebody is searching for the word "tested" he hardly wants "untested" to be found. However you may activate prefixes support using "IspellUsePrefixes yes" command in both indexer.conf and search.htm. Prefixes support may also be found useful for site's spelling checking purposes. Three ispell modes ================== Version 3.0 or higher of mnoGoSearch can store ispell files both in SQL database like in 2.x versions, load ispell files from disk or from a server. To choose ispell mode use "IspellMode text" or "IspellMode db" "IspellMode server". Note that "db" mode works with supported SQL database only and does not work with built-in database. "server" ispell mode is the fastest. "text" mode is faster than "db" in search time, while "db" is faster than "text" in indexing. You may configure indexer to use "text" mode and search front-end to use "db" mode (after having properly imported the same ispell files) at the same time. Using text ispell mode ====================== To make mnoGoSearch support text ispell mode you must specify Affix and Spell commands in both indexer.conf and search.htm files. The format of commands: Affix Spell The first parameter of both commands is two letters language abbrevation. The second one is filename. File names are relative to mnoGoSearch /etc directory. Absolute paths can be also specified. Note that simultaneous loading of several languages is supported. For example, Affix en en.aff Spell en en.dict Affix de de.aff Spell de de.dict will load ispell support for both English and German languages. Using db ispell mode ==================== You can import ispell data in SQL database using "indexer" program. After that indexer, search.cgi and PHP frontend can be switched to use SQL to normalize words by specifying "IspellMode db" in search.htm and indexer.conf. "IspellMode db" gives faster results at search time. To import ispell files use indexer with the the following arguments: $ indexer -L lang -A affix.file (to load affixes) $ indexer -L lang -D dict.file (to load dictionary) For example these commands will import English affixes and dictionary: $ indexer -L en -A en.aff $ indexer -L en -D en.dict Note that ispell files supplied with various languages ispell packages may have different extensions, not *.aff and *.dict only. Use the above formats description to recognize file types. Using spelld server =================== Spelld server reads spell-data from a separate configuration file (/usr/local/mnogosearch/etc/spelld.conf by default), sort it and stores in memory. With clients server communicates in two ways: to indexer all the data is transferred (so that indexer starts faster), from search.cgi server receives word to normalize and then passes over to client (search.cgi) list of normalized word forms. This allows faster processing of search queries (by omitting loading and sorting all the spell data). spelld.conf has the same format as the main configuration file, but the server itself requires only spell-data information for operation. It is necessary to indicate in the main configuration file (as well as in search template) mode of operation with spell server and DNS of the machine that spelld operates on: ispellmode server spellserver.host.dm Spelld uses port 7001, but you can change this value if necessary. See UDM_SPELL_PORT definition in udm_services.h Checking site against correct spelling ====================================== You may change the factors of word weight depending on whether word is found in Ispell dictionaries or not. There are two indexer.conf commands available (with default value 1): IspellCorrectFactor 1 IspellIncorrectFactor 1 Setting the "IspellCorrectFactor" to 0 will prevent indexer from storing words with correct spelling in database. The only incorrect words will be stored in database in this case. If no ispell files are used all words are considered as "incorrect". After editing indexer.conf run indexer as usual. It is better to use "single" storage mode which is most suitable for spelling checking. Then you may easily find incorrect words and corresponding URLs where those words are found using either SQL query "SELECT * FROM dict" (for SQL version) or taking a look into $PREFIX/var/dict.txt file (for built-in database). Adding your own words into dictionary ===================================== It is possible that several rare word are found in your site which are not in ispell dictionaries. You may create the list of such words in plain text file of this format (on word per line): rare.dict: ---------- webmaster intranet ....... www http --------- You may also use ispell flags in this file if you know how to :-) This will allow not to write the same word with different endings to the rare words file, for example "webmaster" and "webmasters". You may choose the word which have the same changing rules from existing ispell dictionary and just to copy flags from it. For example, English dictionary has this line: postmaster/MS So, webmaster with MS flags will be probably OK: webmaster/MS Then copy this file to /etc directory of mnoGoSearch and add this file by Spell command, for example: Spell en rare.dict or import it to use in "db" ispell mode: $ indexer -L en -D rare.dict During next reindexing using "indexer -am" command new words will be considered as words with correct spelling. The only really incorrect words will remain.