rex/porter

mirror of https://github.com/iourinski/porter synced 2026-01-02 08:24:37 +00:00

No description

Find a file

dmitri.iourinski 891db099c1 small changes, single-language setup		2017-01-20 16:24:46 +03:00
includes	fixed macro issues with pre-installed package	2016-12-29 00:46:15 -05:00
rules	fixed macro issues with pre-installed package	2016-12-29 00:46:15 -05:00
templates	template generator, new readme	2016-12-28 22:02:05 -05:00
tests	fixed macro issues with pre-installed package	2016-12-29 00:46:15 -05:00
wordlists	small bug fixes	2016-12-29 14:41:41 -05:00
LICENSE	missed license	2016-12-26 17:31:05 -05:00
porter.nim	small changes, single-language setup	2017-01-20 16:24:46 +03:00
porter.nimble	First commit	2016-12-26 17:26:58 -05:00
porter_dispatch.nim	small changes, single-language setup	2017-01-20 16:24:46 +03:00
README.md	template generator, new readme	2016-12-28 22:02:05 -05:00
template_gen.nim	fixed macro issues with pre-installed package	2016-12-29 00:46:15 -05:00

README.md

porter

Porter stemmer for nim language (with a hopefully easy option to add new languages).

This is a simple implementation of the Porter stemmer algorithm for nim language, using nim's regex implementation.

The algorithm and corresponding terminology can be found at http://snowballstem.org/. There are minor changes made for Russian language implementation (our algorithm is more "greedy" and chops off more suffixes than the original).

Usage:

import porter
...
let stemmer = newStemmer()
# to demonctrate available languages
for language in stemmer.getLanguages():
  echo language
# to stem some english text
let text: string = "Some string in English"
var stems: seq[string] = stemmer.stem(text, "EN")

where 'text' is a string and lang is a language code ("RU", "EN" etc). If an unknown language code is passed tokenized original text is returned, otherwise a sequence of strings corresponding to stemmed words is returned. There is a possibility to add stoplists so words like auxillary verbs, pronouns, and articles are not included into the returned sequence of stems. This is a difference between this stemmer and the classical implementation which returns everything from a stoplist without doing anything to it.

Nim's standard implementation of regular expressions (package re) has certain specifics, which were taken into consideration:

there is no match captioning into a named variable, so posix expressions like s/(something)(something else)/$1/ should be done in two steps: match $1 and $2 and then (if condition holds) replace $2 etc.
ranges for non-ansi characters may not work as expected, so an easy workaround is to use | grouping instead.

Adding new languages

The package is meant to be more or less universal, so there is a small parser for a pseudo-grammar. Adding a new language to the stemmer amounts to running a script template_gen.nim with new language code as the only parameter. The script will generate several template files:

rules/xx_rules.nim
wordlists/xx_stopwords.nim
wordlists/xx_test.nim

where xx is a language code, for instance to add templates for Klingon language run nim c -r template_gen.nim KL.

After the templates are generated edit them according to the comments in code and compile porter.nim. Generated templates are not added to git, so if you want to add them to the repository you need to add them to git manually.

In order for the language stemmer to work properly all the variables added in templates must be declared (they can be left with default values though).

Running nim c -r porter.nim runs tests on all available languages.