Wednesday, December 31, 2014

How to mine a ladder (cont.).

With all the data safely stashed next step is to prepare it for a spreadsheet:

process.js:

function readInput()
{
    var s = "";
    while (!WScript.StdIn.AtEndOfStream) s += WScript.StdIn.ReadAll();
    return s;
}

function pad0(s)
{
    while (s.length<2) s='0'+s;
    return s;
}

function makeISOdate(s)
{
    var ss = /(\d+)\.(\d+)\.(\d+)\s+(\d+):(\d+):?(\d*)/.exec(s);
    return ''+ss[3]+'-'+pad0(ss[2])+'-'+pad0(ss[1])+"T"+pad0(ss[4])+':'+pad0(ss[5])+':'+pad0(ss[6]);
}

function getCommentsNumber(s)
{
    var ss = /^\s*(\d+)/.exec(s);
    return ss ? ss[1] : '0';
}

function trimVersion(s)
{
    var ss = /^(.*)\s(.*?)/.exec(s);
    return ss ? ss[1] : s;
}

var headers = [
"status",
"year",
"short_variant",
"id",
"character",
"ord",
"ord_by_variant",
"player",
"race",
"level",
"exp",
"turn",
"maxdepth",
"variant",
"comments",
"time"
];

WScript.Echo(headers.join("\t"));

var lines = readInput().split("\n");
for (i=0; i<lines.length; ++i)
{
   var s = lines[i];
   if (s=="") continue;
  
   var ss = new Array();

   if (!s.indexOf("<tr style='background-color: rgb(120,60,41)'>")) ss[0] = "dead";
   else if (!s.indexOf("<tr style='background-color: rgb(61,91,61)'>")) ss[0] = "winner";
   else ss[0] = "alive";

   var re = /<td.*?>(.*?)<\/td>/g
   var result
  
   var td = new Array();
   for (k=0; ;++k)
   {
      result = re.exec(s);
      if (!result) break;
      td[k] = result[1];
      if (td[k] == "&nbsp;") td[k] = "";
   }

   result = /<a href='ladder-show\.php\?id=(.*)'><b>(.*)<\/b><\/a>/.exec(td[2])
   ss[3] = result[1];
   ss[4] = result[2];
   ss[5] = td[0];
   ss[6] = td[1];
   result = /<a href="(.*?)">/.exec(td[11])
   ss[7] = result ? result[1] : "";
   ss[8]  = td[3];
   ss[9]  = td[4].replace(/[,]/g,"");
   ss[10] = td[5].replace(/[,]/g,"");
   ss[11] = td[6].replace(/[,]/g,"");
   ss[12] = td[7].replace(/[,]/g,"");
   ss[13] = td[8];
   ss[14] = getCommentsNumber(td[9]);
   ss[15] = makeISOdate(td[10]);
 
   ss[1] = ss[15].substring(0,4);
   ss[2] = trimVersion(td[8]);
  
   for (k=0; k<ss.length; ++k) ss[k] = ss[k].replace(/[\t]/g," ");
   WScript.Echo(ss.join("\t"));
}
Basically it transforms HTML table data into CSV items.
I have come to this script trying to minimize hassle on the spreadsheet. F.e. colunms year and short_variant could be computed by the spreadsheet program, but why not here while we are at it.

In the command prompt:  cscript /nologo process.js <dump >data.txt
Now there is a data.txt, loadable by Excel or OpenOffice Calc.
Lets load it in OpenOffice Calc:

Insert/Sheet from file ...
select data.txt
choose Separated by Tab

and there is a new sheet with all the data nicely shaped.

NOTE: Column time contains text instead of time, each value is prepended with '.  When you delete this ', value becomes an Excel time. Edit it that way one at a time?  Bummer. The solution is to do search and replace on the column and replace .* with &, that is replace text with itself. I recorded that once as a macro and then put on a button on a toolbar. 
Now all is ready for a little analysis.

select first 4 columns A-D
Data/PivotTable/Create...
checked Current selection
drag status onto Page fields, short_variant  Row fields, year Column fields, id  Data fields
select id in Data fileds press Options and choose Count
press More and set Results To to -new sheet-

press OK

And here it is:



Neat.  I see that Angband vanilla has been always actual. Halls of Mist and Neoband are both quite new.  Entroband was on its prime in 2006, etc. If top combo presently with status -all- changed to winner, one could see how many folks have managed to finish the game (and to submit a dump)  I do not want to start a game which is unwinnable, do I?

I started to compare figures in the table and values in Variant selector on the angband ladder site. Figures went identical until Sil and TomeNET. My table did not include them at all. It turned out this particular two variants are not selected by Variant -all- setting. So to add them to the table, grab.js should contain:
getLadderPages("http://angband.oook.cz/ladder-browse.php?");
getLadderPages("http://angband.oook.cz/ladder-browse.php?v=Sil&");
getLadderPages("http://angband.oook.cz/ladder-browse.php?v=TomeNET&");
process.js should also be adjusted for varying format.
You can get final scripts and the spreadsheet document here.

No comments:

Post a Comment