Getting stats out of Wikiversity XML dumps

Someone was wondering how many users edit talk pages and not main namespace articles. So I wrote a couple scripts. Nothing serious but here they are in case someone finds this sort of thing fun. Linux, no windows, sorry... It's not efficient or nice either.

You go get the dump first, at http://download.wikimedia.org/backup-index.html and look for the first (most recent entry with enwikiversity in it.

When you get to the subpage, you want the full archive, bz2 or 7z depending on what utilities you have lying around. These are the ones that say "All pages with complete edit history". Uncompress it and you're ready to get crunching.

First I wanted to get the entire list of users out.

grep ' users.txt cat users.txt | uniq > users-1.txt cat users-1.txt | sort | uniq > users-uniq.txt cat users-uniq.txt | sed -e 's/^\(\s\)* //g; s/<\/username>//g;' > userlist.txt
 * 1) did this cause the file is otherwise slow and big in the sort)

Next I wanted information on each revision in the xml dump: username, title, namespace.

cat enwikiversity-20070903-pages-meta-history.xml | ./versity-xml.pl    > out

Here's the script:


 * 1) !/usr/bin/perl

binmode(STDOUT, ":utf8"); binmode(STDIN, ":utf8");

use encoding(UTF8);
 * 1) write out user, namespace, namespace, for each revision


 * 1) Wikiversity:Help desk
 * 2) Wikiversity talk:Help desk
 * 3) User_talk:

while () { $line=$_; if ($line =~ / /) { $user=""; $ns=""; $title=""; }   elsif ($line =~ / /) { if ($line =~ / ([^:]+):(.+)<\/title>/) { $ns=$1; $title=$2; }          elsif ($line =~ / (.*)<\/title>/) { print "ok here\n"; $ns="main"; $title=$1; }   }    elsif ($line =~ / /) { if ($line =~ / (.*)<\/username>/) { $user=$1; }   }    elsif ($line =~ / /) { $user=""; }   elsif ($line =~ /<\/revision>/) { $out = "$user\t$ns\t$title\n"; if ($out !~ /^\t\t$/) { print "$user\t$ns\t$title\n"; }   }    elsif ($line =~ /<\/page>/) { } }
 * 1)       $out = "$user\t$ns\t$title\n";
 * 2)       if ($out !~ /^\t\t$/) {
 * 3)           print "$user\t$ns\t$title\n";
 * 4)       }

Hey, didn't I say it wouldn't be pretty?

What you need to know to make the script make sense is the structure of the XML dump. If you look at one it has this sort of stuff in it: User:Cormaggio 2 4 2006-08-15T08:19:38Z Cormaggio 8 greetings :-)      Hello all, it's great to finally have Wikiversity up and running! I'm so looking forward to working on this project - but am pretty busy over the next month with my dissertation (about Wikiversity ;-)). I'll be happy to answer any questions about the project - I've been pretty active in getting this project started. Looking forward to worki ng with you! Cormaggio 08:19, 15 August 2006 (UTC) ...

IP address contributors are tagged with instead of.

You can see that the namespace is separated from the title of the page by a colon ':' and if there is no colon then the article is in the main namespace.

I only want one entry per user per page for this next bit...

cat out | uniq > out-1.txt cat out-1.txt | sort | uniq > out-uniq.txt
 * 1) did this cause the file is otherwise slow and big in the sort)

Now collect all the users that edited either the Help desk or User talk pages grep Wikiversity out-uniq.txt | grep Help| grep desk > these.txt grep 'Wikiversity talk' out-uniq.txt | grep Help | grep desk >> these.txt grep User_talk out-uniq.txt >> these.txt more these.txt | awk -F'\t' '{ print $1} ' | sort | uniq > possible

Get some numbers... ./check-these.sh

Here's the script for that: i='some user name here'; echo "$i" >> check2; echo "non-user_talk, helpdesk edits" >> check2;grep "$i" out | grep -v 'Help desk' | grep -v 'User talk' | grep -v User | wc -l >> check2; echo "helpdesk edits" >> check2; grep "$i" out | grep 'Help desk' | wc -l >> check2;echo "user talk edits" >> check2; grep "$i" out | grep 'User talk' | wc -l >> check2 i='some other user name here'; echo "$i" >> check2; echo "non-user_talk, helpdesk edits" >> check2;grep "$i" out | grep -v 'Help desk' | grep -v 'User talk' | grep -v User | wc -l >> check2; echo "helpdesk edits" >> check2; grep "$i" out | grep 'Help desk' | wc -l >> check2;echo "user talk edits" >> check2; grep "$i" out | grep 'User talk' | wc -l >> check2
 * 1) !/bin/bash

and so on, where the names came out of the "possible" list.

The output such as it is looks like

first_user_name non-user_talk, helpdesk edits 0 helpdesk edits 1 user talk edits 0 second_user_name non-user_talk, helpdesk edits 0 helpdesk edits 2 user talk edits 0

and that's it. There weren't so many folks who had edited the help desk or user talk pages in the past year, and we didn't really care about looking at earlier data, so once we had this file we could inspect it by hand to see any useful trends.