FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor


 
 
LinkBack Thread Tools
 
Old 03-02-2010, 01:53 AM
Mel Chua
 
Default FAS scraper

Since I was talking about this in #fedora-mktg as I made this, I thought
I'd share. Basically, Diana was talking about how it's hard for her to
figure out who's an active contributor (for her research) since there
are so many ways and means and places (git, wiki, lists, etc) to
contribute to Fedora, so I said "well, fire up twill, scrape 'em all
down, do some text processing, and you'll have a per-user portfolio you
can analyze to get an 'activity count.'"

After several hours of being too distracted to actually implement a
quick-and-dirty proof of concept, I sat down and spent (according to IRC
timestamps) 8 minutes actually looking up twill python API syntax and
writing 11 lines of code to do the job, then 29 minutes to comment it,
perhaps a little too exhaustively.

http://mchua.fedorapeople.org/FAS_scraper

When run, this will take a list of FAS usernames and spit out a series
of <username>.html files containing multiple-service "portfolios" for
that user (currently: wiki edits and packages maintained, but easily
extensible).

I've pasted the README below to give folks an idea of what this does.
It's a proof-of-concept looking for someone who can architecture and
implement it better, as I don't really have the time to do it properly.

--- README.txt ---

# FAS_scraper.py
# v.1.0 (March 1, 2010)
# Mel Chua <mchua@fedoraproject.org>

# This is a quick proof-of concept scraper inspired by Diana Martin's
research
# on the Fedora community; she's trying to get a gauge on who in Fedora
# is an "active contributor," so I suggested making a tiny scraper to gather
# all the FAS-authenticated activity of a user from existing webpages.
# I'm pretty sure most of these services have APIs that would do the job
# better and less kludgily, but this is just to see if it's a useful thing.

== Caveat ==

This isn't actually a proper README.txt - rather, a quick hack taken
from the opening code comments. The python code itself is extensively
commented (there are 11 lines of actual code in the 46-line file).

== Installation ==

You will need python and twill installed to run this script. On Fedora:

yum install python python-twill

Then download FAS_scraper.py into a directory and run it:

python FAS_scraper.py

You'll see a lot of output (the html of the pages being scraped) being
dumped into your terminal; I'm leaving it verbose for now on purpose so
people can see what's going on.

You'll end up with a series of <username>.html in the directory that
FAS_scraper.py is in. These contain the raw html dumps of the profile
pages for that FAS user for each specified service.

== Sample output ==

http://mchua.fedorapeople.org/FAS_scraper/sample_output

== Further developments ==

Some quick suggestions for further work - what actually needs to happen
is for this to be re-architected into a good general-purpose python
library for getting data from FAS-authenticated services.

* Instead of manually defining the list of FAS usernames in the code,
grab the list of usernames from the actual FAS system.

* Check for validity of FAS users you're looking for - right now, if you
enter a username that doesn't exist, the program will try to download
the pages for that user anyway. (It won't stop the program, you'll just
get output for that user consisting of webpages saying that the user
doesn't exist.)

* Add more services.

* Check for validity of services.

* Create a class for services so that we can handle cases that aren't
reachable by the format <start_of_url>/<username>. (For instance, what
if it's <start_of_url>/<username>/<end_of_url>?)

* Create a class for users that can parse and spit out statistics for
each of the services you're looking at. For instance, can you
automatically get the value of username.pkgdb.number_maintained()?
--
marketing mailing list
marketing@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/marketing
 

Thread Tools




All times are GMT. The time now is 02:21 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org