Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Kubuntu User (http://www.linux-archive.org/kubuntu-user/)
-   -   Make a word list from a text (http://www.linux-archive.org/kubuntu-user/136076-make-word-list-text.html)

Wulfy 08-02-2008 03:52 AM

Make a word list from a text
 
I want to take a text file and extract all the words and sort them into
a unique list. I've looked at split, cut, sed and awk (the last two
just confused me no end... :@( ) and I can't find a imple way to do
it. I suppose I could write a Java program to do it, but it seems silly
to reinvent the wheel like that. I'm sure there are a bazillion ways to
do it on the command line but I'm flummoxed. I tried googling and every
search string I tried brought me dozens of Windows programs to do the
job or python programs, but nothing I could understand...

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

"Mark A. Taff" 08-02-2008 05:00 AM

Make a word list from a text
 
On Friday 01 August 2008 20:52:46 Wulfy wrote:
> I want to take a text file and extract all the words and sort them into
> a unique list. I've looked at split, cut, sed and awk (the last two
> just confused me no end... :@( ) and I can't find a imple way to do
> it. I suppose I could write a Java program to do it, but it seems silly
> to reinvent the wheel like that. I'm sure there are a bazillion ways to
> do it on the command line but I'm flummoxed. I tried googling and every
> search string I tried brought me dozens of Windows programs to do the
> job or python programs, but nothing I could understand...
>
> --
> Blessings
>
> Wulfmann
>
> Wulf Credo:
> Respect the elders. Teach the young. Co-operate with the pack.
> Play when you can. Hunt when you must. Rest in between.
> Share your affections. Voice your opinion. Leave your Mark.
> Copyright July 17, 1988 by Del Goetz

How about:


perl -e '$data = `cat ./pgadmin.log`; @words = split(/ /, $data); foreach
$word (@words) { print "$word
"; }'|sort|uniq

Replace ./pgadmin.log with your FILE.

HTH,

Mark

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Brendan 08-02-2008 05:03 AM

Make a word list from a text
 
On Friday 01 August 2008, Wulfy wrote:
> I want to take a text file and extract all the words and sort them into
> a unique list. I've looked at split, cut, sed and awk (the last two

This is not exactly correct, but this is a good start...
It's from memory, so it should only be slightly wrong.

#!/usr/bin/perl

my $filename = "foo.txt";
open( FILE, "< $filename" ) or die "Can't open $filename : $!";
my @words;
my @tmp;

while (<FILE>){
@tmp = split(/ /, $_);
push @tmp, "
";
push @words, @tmp ;

# push @words, (split(/ /, $_));
# push @words, "
";

}

print @words;

exit;

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Wulfy 08-02-2008 05:52 AM

Make a word list from a text
 
Wulfy wrote:
> I want to take a text file and extract all the words and sort them into
> a unique list. I've looked at split, cut, sed and awk (the last two
> just confused me no end... :@( ) and I can't find a imple way to do
> it. I suppose I could write a Java program to do it, but it seems silly
> to reinvent the wheel like that. I'm sure there are a bazillion ways to
> do it on the command line but I'm flummoxed. I tried googling and every
> search string I tried brought me dozens of Windows programs to do the
> job or python programs, but nothing I could understand...
>
>
Many thanks to Mark and Brendan for their help.

Mark's program gave me a list of words, one to a line, I now need to
remove punctuation. Brendan's program removed all the spaces but
otherwise left the rest of the text as it was,

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

"Mark A. Taff" 08-02-2008 06:17 AM

Make a word list from a text
 
> Mark's program gave me a list of words, one to a line, I now need to
> remove punctuation. Brendan's program removed all the spaces but
> otherwise left the rest of the text as it was,


perl -e '$data = `cat ./pgadmin.log`; $data =~ s/[?.,";:()/\_*!]//g;
@words = split(/ /, $data); foreach $word (@words) { print "$word
"; }'|
sort|uniq

This version will remove most punctuation, notably except apostrophe's. You
start running into context problems: Is that apostrophe marking a plural
(mark's computer) or omitted character (ma'am, don't) or quoting ("blah,"
said Mark). Same applies to dashes and hyphenated words (self-defense).

But, this will get you close.

HTH,

Mark

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Wulfy 08-02-2008 07:02 AM

Make a word list from a text
 
Mark A. Taff wrote:
>> Mark's program gave me a list of words, one to a line, I now need to
>> remove punctuation. Brendan's program removed all the spaces but
>> otherwise left the rest of the text as it was,
>>
>
>
> perl -e '$data = `cat ./pgadmin.log`; $data =~ s/[?.,";:()/\_*!]//g;
> @words = split(/ /, $data); foreach $word (@words) { print "$word
"; }'|
> sort|uniq
>
> This version will remove most punctuation, notably except apostrophe's. You
> start running into context problems: Is that apostrophe marking a plural
> (mark's computer) or omitted character (ma'am, don't) or quoting ("blah,"
> said Mark). Same applies to dashes and hyphenated words (self-defense).
>
> But, this will get you close.
>
> HTH,
>
> Mark
>
>
Wonderful! Thanks so much! :@)

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Brendan 08-02-2008 06:50 PM

Make a word list from a text
 
On Saturday 02 August 2008, Wulfy wrote:
> Wulfy wrote:
> > I want to take a text file and extract all the words and sort them into
> > a unique list. I've looked at split, cut, sed and awk (the last two
> > just confused me no end... :@( ) and I can't find a imple way to do
> > it. I suppose I could write a Java program to do it, but it seems silly
> > to reinvent the wheel like that. I'm sure there are a bazillion ways to
> > do it on the command line but I'm flummoxed. I tried googling and every
> > search string I tried brought me dozens of Windows programs to do the
> > job or python programs, but nothing I could understand...
>
> Many thanks to Mark and Brendan for their help.
>
> Mark's program gave me a list of words, one to a line, I now need to
> remove punctuation. Brendan's program removed all the spaces but
> otherwise left the rest of the text as it was,

Yeah, the perl debugger in my head doesn't work so well. ;-)

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

"John DeCarlo" 08-02-2008 10:06 PM

Make a word list from a text
 
On Sat, Aug 2, 2008 at 2:17 AM, Mark A. Taff <marktaff@comcast.net> wrote:

This version will remove most punctuation, notably except apostrophe's. *You

start running into context problems: Is that apostrophe marking a plural

(mark's computer)
Oops, you mean "possessive", not plural.

--
John DeCarlo, My Views Are My Own


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Wulfy 08-02-2008 10:35 PM

Make a word list from a text
 
Brendan wrote:
> On Saturday 02 August 2008, Wulfy wrote:
>
>> Wulfy wrote:
>>
>>> I want to take a text file and extract all the words and sort them into
>>> a unique list. I've looked at split, cut, sed and awk (the last two
>>> just confused me no end... :@( ) and I can't find a imple way to do
>>> it. I suppose I could write a Java program to do it, but it seems silly
>>> to reinvent the wheel like that. I'm sure there are a bazillion ways to
>>> do it on the command line but I'm flummoxed. I tried googling and every
>>> search string I tried brought me dozens of Windows programs to do the
>>> job or python programs, but nothing I could understand...
>>>
>> Many thanks to Mark and Brendan for their help.
>>
>> Mark's program gave me a list of words, one to a line, I now need to
>> remove punctuation. Brendan's program removed all the spaces but
>> otherwise left the rest of the text as it was,
>>
>
> Yeah, the perl debugger in my head doesn't work so well. ;-)
>
>
Still, you took the time to try to help and that's what this community
is supposed to be about. I couldn't have even begun to write that perl
program so any help is appreciated! :@)

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users

Donn 08-03-2008 03:17 AM

Make a word list from a text
 
On Saturday, 02 August 2008 05:52:46 Wulfy wrote:
> I want to take a text file and extract all the words and sort them into
> a unique list.
I gave it a go and this is the best I can do:
cat myfile | sed "s/'//g" | tr -s '[:space:][:punct:]' "
" | sort | uniq -c

The sed bit is to remove single quotes so words like "didn't" don't
become "didn" and "t". It then uses tr to replace spaces or punctuation with
newlines and then out to sort and uniq.

I find text parsing very hard to do. There seem to be corner-cases everywhere.
What is a word really? How do you define it's edges? Ah well, HTH.
d

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users


All times are GMT. The time now is 08:31 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.