Time to combine two of my hobbies, World of Warcraft and programming.
A while ago I worked on an armory scraper / crawler that’s able to parse the WoW Armory XML.
Why do you ask? Well I just really like numbers and statistics. And I’m sure there’s lots of interesting stuff to find if you can search the armory with your own queries and create nice overviews and graphs (everyone loves graphs!).
I’m at a point now where I give the crawler an initial character to start with and it just crawls from there.
Picking up the guild and parsing all the guild members.
Whenever the crawler picks up a new character it will look at that characters arena teams and adds those characters to the database.
While it’s parsing them it will usually find new guilds and there we go, another load of characters added to the database.
So broken down to a flowchart it would look something like this.
Now you’ll notice there’s one big flaw. In order for me to find a character it needs to be either in a guild or in an arena team with a person who’s in a guild.
Also that guild can’t be a one man guild.
The only way I can think of to fix this is to manually add or allow those people to add themselves to the database.
But because I’m guessing it’s only a really small amount of people who aren’t linked on the armory I am not to worried about it.
The last time I ran the crawler I got about 10.000 characters which is quite a nice set to work with.
The crawler is written in Python because of the excellent libraries it comes with and the pace at which you can code.
The only problem I keep running into however is character encoding. Because people use weird characters in their names it sometimes breaks my code and I haven’t really figured out how to properly fix this. Maybe it’s the MySQL backend? For now I just found something that works to be able to add more and more scraping capabilities.
I’ve set the crawl rate at about 1 page every 2 seconds so I don’t spam the armory to much and it seems to be fine like that. Once you go below a second it tends to sometimes choke and cut you off. The bandwidth consumption isn’t that bad though since it’s only asking the server for the XML so it’s not loading all the images and other doodads.
So far i’m able to scrape most of the data I want from the armory. I can get all of the stats and the gear the character is wearing. As well as the glyphs and talent points. The only challenge that’s left seems to be the achievements but I’m sure I’ll eventually work that out as well.
Once I got a nice set of data I’ll probably upload it here so you can all take a look.


Posted in
Tags: 
Cheers, superb blog.