FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.

» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Development

LinkBack Thread Tools
Old 12-25-2010, 10:00 AM
Joost van Baal
Default Bug#607964: ITP: ucto -- Unicode Tokenizer

Package: wnpp
Severity: wishlist
Owner: Joost van Baal <joostvb-debian-bugs-20101225-3@mdcc.cx>

* Package name : ucto
Upstream Author : ILK Research Group, Tilburg University, http://ilk.uvt.nl
* URL : http://ilk.uvt.nl/mbt/
* License : GPL-3
Programming Lang: C++
Description : Unicode Tokenizer

Ucto can tokenize UTF-8 encoded text files (i.e. separate words from
punctuation, split sentences, generate n-grams), and offers several other
basic preprocessing steps (change case, count words/characters and reverse
lines) that make your text suited for further processing such as indexing,
part-of-speech tagging, or machine translation.
Ucto is a product of the ILK Research Group, Tilburg University (The
If you are interested in machine parsing of UTF-8 encoded text files, e.g. to
do scientific research in natural language processing, ucto will likely be of
use to you.


Upstream has not yet officially released ucto; currently there's just an
obsolete prerelease snapshot and some promissing code in SVN (not git). See
also https://github.com/proycon , http://proylt.anaproy.nl/en/software/ and
http://proylt.anaproy.nl/media/software/ .

The frog package (See Bug#605905: ITP: frog -- tagger and parser for Dutch
language) will depend upon ucto. Frog will be the new name and reincarnation
of tadpole, see http://ilk.uvt.nl/tadpole/ .



irc:joostvb@{OFTC,freenode} ∙ http://mdcc.cx/ ∙ http://ad1810.com/

Thread Tools

All times are GMT. The time now is 05:38 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org