FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 09-29-2012, 10:14 PM
 
Default Using wget to fill in a form

> They've learned a lot about the structure of classification systems since
> LC was set up.

I've been doing some reading, and there is work under way to modernize the
classification system. In the meantime this works for my needs. I do appreciate
the suggestion.


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1348956870.8325375@webmail.gtek.biz">http://lists.debian.org/1348956870.8325375@webmail.gtek.biz
 
Old 09-29-2012, 10:41 PM
 
Default Using wget to fill in a form

In the end I did pretty much as suggested, using wget and re-using session IDs.
I created a bash script that gets a session ID, reads the list of ISBN numbers,
and then tries to retrieve their info. If the retrieval returns a session
expired then it gets a new one. It also does a decent job of outputting the
retrieved records into a csv format for easy import into a database or XML.

The script, and my list of 25 test ISBNs are included below. Interestingly,
about five, or 20% come up with no record found.

If I try to do anything more fancy then I will learn how to query the MARC
system directly. The LOC site has a lot of information available.

I appreciate all of the help and suggestions I received.



#!/bin/bash

#*******************************************#
# getLOCinfo.sh #
# #
# A script to read a list of ISBN numbers #
# from an input file, and to retrieve the #
# LOC info for that item from the LOC web #
# search form. #
# #
# The input file is expected to contain #
# a single line of ISBN numbers separated #
# by whitespace. Alternatively, the file #
# can contain one ISBN per line as long as #
# all but the final line ends with white- #
# space followed by a backslash (actually #
# I think all lines can end that way). #
#*******************************************#

# Script Constants:
BASE_URL="http://www.loc.gov/cgi-bin/zgate"
E_BAD_ARGS=65
E_BAD_FILE=66
E_NO_SESSION_ID=67
NUM_ARGS=2
NUM_EXPIRED=10
SUCCESS=0

# Script variables:
expired_count=0
result="Your session has expired"
result_url=$BASE_URL
session_url=$BASE_URL

# A function to get a new sessionid:
GetSessionID ()
{
session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/"
session_url=$session_url"locils2.html,z3950.loc.go v,7090"
sessionid=`wget $session_url -o /dev/null -O - |
grep SESSION_ID |
cut -d """ -f4`
if [ -z $sessionid ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
fi
}

# A function to "build" the request URL:
BuildURL ()
{
url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME =B&MAXRECORDS=20&"
url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&"
url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,"
url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100, 6,1&SESSION_ID=$1&"
url=$url"TERM_1=$2"
}

# Make sure file names were supplied when the script was called:
if [ $# -ne $NUM_ARGS ]
then
echo "ERROR: Incorrect number of parameters supplied. Exiting..."
exit $E_BAD_ARGS
fi

# Make sure the input file exists and is not empty:
if [ ! -f "$1" ] || [ ! -s "$1" ]
then
echo "ERROR: $1 not found or is an empty file. Exiting..."
exit $E_BAD_FILE
fi

# Truncate the output file if necessary:
if [ -s $2 ]
then
echo -n "Warning: $2 exists and is not empty. Continue [y/N]? "
read input
if [ `echo $input | tr A-Z a-z` != "y" ]
then
echo "Please provide a valid output file name"
exit $E_BAD_FILE
fi
cat /dev/null > $2
fi

# Get a session ID:
GetSessionID

# Read the file contents:
read isbn_list < $1

for isbn in $isbn_list
do
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "
" " "`
while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] &&
[ $expired_count -lt $NUM_EXPIRED ]
do
let "expired_count+=1"
GetSessionID
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "
" " "`
done

if [ $expired_count -eq $NUM_EXPIRED ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
else
expired_count=0
fi

if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ]
then
# Print the not found message to stderr:
echo "$isbn: No record found" >&2
else
echo -n ""$isbn"," >> $2
echo $result | sed -n -e 's/.*<pre>(.*)</pre>.*/1/Ip' |
sed -e 's/ +/ /g' |
sed -e 's/^Author: /"/' |
sed -e 's/., [0-9]{4}-[0-9]{0,4} (Title: )/. 1/' |
sed -e 's/. Title: /","/' |
sed -e 's/. Published: /","/' |
sed -e 's/, c([0-9]{4}). LC Call No.: /","1","/' |
sed -e 's/ *$/"/'
>> $2
fi
done

exit $SUCCESS

##### ISBN List: ################################################## #############

0805375651
0314027157
0201087987
9780980232714
0131774115
0789731274
1874416656
1886411484
9780425238981
0070726922
0495011622
1565927699
0673524841
0721659659
9781847991683
0596100795
0596001584
9780980455205
0835930513
9780954452971
0619121475
9780321553577
0130424110
0201612445
9780123705488


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1348958474.34867622@webmail.gtek.biz">http://lists.debian.org/1348958474.34867622@webmail.gtek.biz
 
Old 09-30-2012, 11:33 AM
Morten Bo Johansen
 
Default Using wget to fill in a form

craig@gtek.biz <craig@gtek.biz> wrote:

> I have a small book collection (~150) that I thought would be neat to
> catalog by the Library of Congress catalog numbers. I have found a LOC
> search form that will allow me to input the ISBN, and it will return
> the information I want:

[..]

> I have the list of book ISBNs in a text file, so scripting this should
> be quite easy. The problem is I can't figure out how to submit the form
> from the command line. I figured wget would be the best way, but
> everything I try results in downloading a single line that reads "Your
> form didn't include an ACTION!" So I thought I would turn to here for
> help. The test ISBN I am using is for The Linux Cookbook: 1886411484,
> QA76.76.O63S788 2001.

There are several urls on loc.gov that will retrieve book information
from an ISBN. The one below has no problem with session cookies. So
wouldn't this quick and dirty one-liner do what you want?


#!/bin/sh

# loc.sh <ISBN>

elinks -dump -dump-charset utf8 -no-references -no-numbering
"http://www.loc.gov/cgi-bin/zclient?host=z3950.loc.gov&port=
7090&attrset=BIB1&rtype=USMARC&DisplayRecordSyntax =HTML&ESN=F&startrec=
1&maxrecords=10&dbname=Voyager&srchtype=1,7,2,3,3, 1,4,1,5,1,6,1&term_term_1=
$1"

so loc.sh 1886411484 will output the information for the Linux Cookbook
in a pure text format.

> And a related side question. From my reading, I've learned that the
> Z39.50 protocol is used to query databases, usually library related. Is
> anyone aware of an ISBN database table that can be downloaded by the
> user, preferably in a format that can be imported into MySQL or
> PostgreSQL?

Probably, but I suppose the output is very standardized and then you can
easily convert it to csv-format or something.


Regards,

Morten


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: slrnk6gbgu.7au.mbj@gatsby.mbjnet.dk">http://lists.debian.org/slrnk6gbgu.7au.mbj@gatsby.mbjnet.dk
 
Old 10-02-2012, 03:41 PM
 
Default Using wget to fill in a form

> There are several urls on loc.gov that will retrieve book information
> from an ISBN. The one below has no problem with session cookies. So
> wouldn't this quick and dirty one-liner do what you want?
>
>
> #!/bin/sh
>
> # loc.sh <ISBN>
>
> elinks -dump -dump-charset utf8 -no-references -no-numbering
> "http://www.loc.gov/cgi-bin/zclient?host=z3950.loc.gov&port=
> 7090&attrset=BIB1&rtype=USMARC&DisplayRecordSyntax =HTML&ESN=F&startrec=
> 1&maxrecords=10&dbname=Voyager&srchtype=1,7,2,3,3, 1,4,1,5,1,6,1&term_term_1=
> $1"
>
> so loc.sh 1886411484 will output the information for the Linux Cookbook
> in a pure text format.
>

Well that certainly looks a lot better than what I came up with. I will
have to give it a try, but doubt I will have time before Friday to play
with this again. I'll let you know. Out of curiosity, can this be done
with lynx instead since I have it installed? If not, I can always
install elinks.

Thanks!


Sent - Gtek Web Mail



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1349192494.403730645@webmail.gtek.biz">http://lists.debian.org/1349192494.403730645@webmail.gtek.biz
 

Thread Tools




All times are GMT. The time now is 01:13 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org