16 November 2010

TinySegmenter for C#

TinySegmenter is a text tokenizer for Japanese (and English). I wrote a translation to C#, which you can download from http://asheldritch.googlecode.com/files/TinySegmenter.cs

A tokenizer splits bodies of text into individual words, useful for example in keyword searching. Other implementations are:
  • Perl: http://search.cpan.org/~jiro/Text-TinySegmenter-0.01/lib/Text/TinySegmenter.pm
  • Javascript: http://chasen.org/~taku/software/TinySegmenter/
  • http://code.google.com/p/asheldritch/wiki/TinySegmenter: http://lilyx.net/pages/tinysegmenterp.html
  • Objective-C: http://blog.bornneet.com/Entry/276/
  • Lisp: http://miyamuko.s56.xrea.com/xyzzy/tiny-segmenter.html
  • Ruby: http://d.hatena.ne.jp/llamerada/20080224/1203818061
  • VBA: http://pub.ne.jp/arihagne/?cat_id=123314