Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Debian Development (http://www.linux-archive.org/debian-development/)
-   -   Spell checker as reasonable SPAM prevention tool (http://www.linux-archive.org/debian-development/487988-spell-checker-reasonable-spam-prevention-tool.html)

Samuel Thibault 02-11-2011 08:42 AM

Spell checker as reasonable SPAM prevention tool
 
Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> PS: I assume that a spell checker can be configured that way that it
> can distinguish between writing an English text with some / several
> mistakes and a text with say 50% error rate which is probably not
> understandable anyway.

Mmm, I think we've already had users that have even 50% error rate,
simply because they mispell things. Yes, not everybody has even a basic
knowledge level in english, but they still can provide useful input to a
mailing list.

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110211094249.GA5817@const.bordeaux.inria.fr">htt p://lists.debian.org/20110211094249.GA5817@const.bordeaux.inria.fr

Cyril Brulebois 02-11-2011 08:49 AM

Spell checker as reasonable SPAM prevention tool
 
Samuel Thibault <sthibault@debian.org> (11/02/2011):
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things.

I like the intended pun!

KiBi.

Michelle Konzack 02-11-2011 09:23 AM

Spell checker as reasonable SPAM prevention tool
 
Hello Samuel Thibault,

Am 2011-02-11 10:42:49, hacktest Du folgendes herunter:
> Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> > PS: I assume that a spell checker can be configured that way that it
> > can distinguish between writing an English text with some / several
> > mistakes and a text with say 50% error rate which is probably not
> > understandable anyway.
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things. Yes, not everybody has even a basic
> knowledge level in english, but they still can provide useful input to a
> mailing list.

In the arround 600 latvian spams I have gotten the last 3 weeks, there
are enough keywords which identify the mais as spam and I do not know
why, but spamassassin gaved the messages a score of -4 and greater.

Thanks, Greetings and nice Day/Evening
Michelle Konzack

--
##################### Debian GNU/Linux Consultant ######################
Development of Intranet and Embedded Systems with Debian GNU/Linux

itsystems@tdnet France EURL itsystems@tdnet UG (limited liability)
Owner Michelle Konzack Owner Michelle Konzack

Apt. 917 (homeoffice)
50, rue de Soultz Kinzigstraße 17
67100 Strasbourg/France 77694 Kehl/Germany
Tel: +33-6-61925193 mobil Tel: +49-177-9351947 mobil
Tel: +33-9-52705884 fix

<http://www.itsystems.tamay-dogan.net/> <http://www.flexray4linux.org/>
<http://www.debian.tamay-dogan.net/> <http://www.can4linux.org/>

Jabber linux4michelle@jabber.ccc.de
ICQ #328449886

Linux-User #280138 with the Linux Counter, http://counter.li.org/

Andreas Tille 02-11-2011 09:44 AM

Spell checker as reasonable SPAM prevention tool
 
On Fri, Feb 11, 2011 at 10:42:49AM +0100, Samuel Thibault wrote:
> Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> > PS: I assume that a spell checker can be configured that way that it
> > can distinguish between writing an English text with some / several
> > mistakes and a text with say 50% error rate which is probably not
> > understandable anyway.
>
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things. Yes, not everybody has even a basic
> knowledge level in english, but they still can provide useful input to a
> mailing list.

It might be a topic of fuerther investigation what limit on the error
rate to put but I'm quite positive that there are reasonable algorithms
to detect in what language a text is in or rather to detect whether a
text atempts to be written in a certain language (which is probably
easier than to guess a language). The question whether it is worth
doing some stats on the mailing list archive about this is rather if we
finally want this language detection method for a SPAM filter or not.

My guess is that you will find a ratio of misspelled words / total
number of words which is a clear sign for non-English text, than you
have some intermediate area where those postings like you are afraid
about are belonging to and than there are the postings which are
obviosely trying hard to write some English. I'd like to get rid of
the clearly non-English texts. I have the impression that we get more
and more of these since some time and I assume that bayesian filters
are not (yet) trained good enough to detect these as SPAM. So we need
to find some other means.

Kind regards

Andreas.

--
http://fam-tille.de


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110211104413.GB2274@an3as.eu">http://lists.debian.org/20110211104413.GB2274@an3as.eu

"brian m. carlson" 02-11-2011 01:27 PM

Spell checker as reasonable SPAM prevention tool
 
On Fri, Feb 11, 2011 at 10:19:07AM +0100, Andreas Tille wrote:
> since some time we get more and more SPAM which is easily to detect for
> me (and most probably automatically): SPAM in languages I do simply not
> understand and which are definitely not English. Wouldn't it be a
> reasonable means for a SPAM filter to mark mails which blatantly fail a
> spell checker to mark as potential SPAM and just apply this filter to
> all Debian lists. We have defined languages for each list and the "one
> mail per month" were a user just writes in the wrong language by
> accident will probably not harm the project.

I've been thinking about this some as well for my personal domain.
Debian has tools that can determine the language of a document
(libtextcat and friends). Emails that are 70% or more composed of
languages that I have no hope of speaking or understanding (i.e.,
everything but English, Spanish, French, and Portuguese) would be
rejected. I chose 70% as the threshold because sometimes Debian lists
get mails from users in both English and another language (in hopes of
being understood) and I wouldn't want to penalize those users. I
haven't implemented this, but I might at some point.

Obviously, this would have to be adjusted per-list; we wouldn't want to
reject German-language emails to debian-user-german. I also think
language testing is better than spell checking for English because
honestly English has a lot of pretty irregular and bizarre spellings; I
say this as someone whose native language is English and who spells
fairly decently. A spell checker might catch more legitimate emails
than we'd like.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

The Fungi 02-11-2011 08:17 PM

Spell checker as reasonable SPAM prevention tool
 
On Fri, Feb 11, 2011 at 10:19:07AM +0100, Andreas Tille wrote:
[...]
> I assume that a spell checker can be configured that way that it
> can distinguish between writing an English text with some /
> several mistakes and a text with say 50% error rate which is
> probably not understandable anyway.

But could it reliably pass MBF announcements which are 99% package
names and (often numerous non-English) maintainer names? Or a
message which is 80% C source code because it contains a patch under
discussion? Those definitely seem to me like important test cases,
at least, which I don't think most human-language-oriented
spell-checkers would deal with well (though I'd love to be proven
wrong!).
--
{ IRL(Jeremy_Stanley); WWW(http://fungi.yuggoth.org/); PGP(43495829);
WHOIS(STANL3-ARIN); SMTP(fungi@yuggoth.org); FINGER(fungi@yuggoth.org);
MUD(kinrui@katarsis.mudpy.org:6669); IRC(fungi@irc.yuggoth.org#ccl);
ICQ(114362511); YAHOO(crawlingchaoslabs); AIM(dreadazathoth); }


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110211211650.GO9372@yuggoth.org">http://lists.debian.org/20110211211650.GO9372@yuggoth.org


All times are GMT. The time now is 10:09 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.