Wednesday, December 31, 2014

How to mine a ladder (cont.).

With all the data safely stashed next step is to prepare it for a spreadsheet:

process.js:

function readInput()
{
    var s = "";
    while (!WScript.StdIn.AtEndOfStream) s += WScript.StdIn.ReadAll();
    return s;
}

function pad0(s)
{
    while (s.length<2) s='0'+s;
    return s;
}

function makeISOdate(s)
{
    var ss = /(\d+)\.(\d+)\.(\d+)\s+(\d+):(\d+):?(\d*)/.exec(s);
    return ''+ss[3]+'-'+pad0(ss[2])+'-'+pad0(ss[1])+"T"+pad0(ss[4])+':'+pad0(ss[5])+':'+pad0(ss[6]);
}

function getCommentsNumber(s)
{
    var ss = /^\s*(\d+)/.exec(s);
    return ss ? ss[1] : '0';
}

function trimVersion(s)
{
    var ss = /^(.*)\s(.*?)/.exec(s);
    return ss ? ss[1] : s;
}

var headers = [
"status",
"year",
"short_variant",
"id",
"character",
"ord",
"ord_by_variant",
"player",
"race",
"level",
"exp",
"turn",
"maxdepth",
"variant",
"comments",
"time"
];

WScript.Echo(headers.join("\t"));

var lines = readInput().split("\n");
for (i=0; i<lines.length; ++i)
{
   var s = lines[i];
   if (s=="") continue;
  
   var ss = new Array();

   if (!s.indexOf("<tr style='background-color: rgb(120,60,41)'>")) ss[0] = "dead";
   else if (!s.indexOf("<tr style='background-color: rgb(61,91,61)'>")) ss[0] = "winner";
   else ss[0] = "alive";

   var re = /<td.*?>(.*?)<\/td>/g
   var result
  
   var td = new Array();
   for (k=0; ;++k)
   {
      result = re.exec(s);
      if (!result) break;
      td[k] = result[1];
      if (td[k] == "&nbsp;") td[k] = "";
   }

   result = /<a href='ladder-show\.php\?id=(.*)'><b>(.*)<\/b><\/a>/.exec(td[2])
   ss[3] = result[1];
   ss[4] = result[2];
   ss[5] = td[0];
   ss[6] = td[1];
   result = /<a href="(.*?)">/.exec(td[11])
   ss[7] = result ? result[1] : "";
   ss[8]  = td[3];
   ss[9]  = td[4].replace(/[,]/g,"");
   ss[10] = td[5].replace(/[,]/g,"");
   ss[11] = td[6].replace(/[,]/g,"");
   ss[12] = td[7].replace(/[,]/g,"");
   ss[13] = td[8];
   ss[14] = getCommentsNumber(td[9]);
   ss[15] = makeISOdate(td[10]);
 
   ss[1] = ss[15].substring(0,4);
   ss[2] = trimVersion(td[8]);
  
   for (k=0; k<ss.length; ++k) ss[k] = ss[k].replace(/[\t]/g," ");
   WScript.Echo(ss.join("\t"));
}
Basically it transforms HTML table data into CSV items.
I have come to this script trying to minimize hassle on the spreadsheet. F.e. colunms year and short_variant could be computed by the spreadsheet program, but why not here while we are at it.

In the command prompt:  cscript /nologo process.js <dump >data.txt
Now there is a data.txt, loadable by Excel or OpenOffice Calc.
Lets load it in OpenOffice Calc:

Insert/Sheet from file ...
select data.txt
choose Separated by Tab

and there is a new sheet with all the data nicely shaped.

NOTE: Column time contains text instead of time, each value is prepended with '.  When you delete this ', value becomes an Excel time. Edit it that way one at a time?  Bummer. The solution is to do search and replace on the column and replace .* with &, that is replace text with itself. I recorded that once as a macro and then put on a button on a toolbar. 
Now all is ready for a little analysis.

select first 4 columns A-D
Data/PivotTable/Create...
checked Current selection
drag status onto Page fields, short_variant  Row fields, year Column fields, id  Data fields
select id in Data fileds press Options and choose Count
press More and set Results To to -new sheet-

press OK

And here it is:



Neat.  I see that Angband vanilla has been always actual. Halls of Mist and Neoband are both quite new.  Entroband was on its prime in 2006, etc. If top combo presently with status -all- changed to winner, one could see how many folks have managed to finish the game (and to submit a dump)  I do not want to start a game which is unwinnable, do I?

I started to compare figures in the table and values in Variant selector on the angband ladder site. Figures went identical until Sil and TomeNET. My table did not include them at all. It turned out this particular two variants are not selected by Variant -all- setting. So to add them to the table, grab.js should contain:
getLadderPages("http://angband.oook.cz/ladder-browse.php?");
getLadderPages("http://angband.oook.cz/ladder-browse.php?v=Sil&");
getLadderPages("http://angband.oook.cz/ladder-browse.php?v=TomeNET&");
process.js should also be adjusted for varying format.
You can get final scripts and the spreadsheet document here.

Tuesday, December 30, 2014

How to mine a ladder.

Once upon a time I used to play Moria, then, a decade later, its successor Angband. I was playing in earnest, and though I never finished Moria, being killed sooner or later invariably,  I've beaten Angband with a warrior, allegedly more difficult class than casters or rangers. Well, they have special name for such folks, nerds^H^H roguelike fans, and a special place to keep their feats, the Angband ladder. My victory, is not it glorious! More than ten years have passed and it is still there, thanks to internets and to Páv Lučištník.

Yet another decade later I decided to revive my old thrills and sorrows and to try to kill the lurking dungeon boss once again, winter holidays and such. The problem was which game to choose, the latest Angband release or one of the variants. There are lots of them. I thought I should look what other peoples choose now, what is cool, what is dead and to do that I shall use Angband ladder data. It has assembled 12839 entries since 2002 and continues to grow, 7 entries added last weekend (turned out to be even more, on that later)  And of course, mere web interface of the ladder wouldn't do. Business means sorting, grouping, dicing, slicing and, mind you, diagramming until one has all the information for this problem of choice. So the little project begun...

First thing was to collect the site data, and it better be with just one try. Nobody likes their sites being pestered by someone's test runs. The simpler the grabber, the less moving parts are in there, the better. It might write the entire HTML pages to output but then I decided to get only table rows with data. It's nicer looking that way, one can tell by line count how many records are there, and it streamlines the further processing. Also I decided I won't need the character dumps themselves. So, the grabber (4th iteration, so much for "one try" approach):
  
grab.js:
function getText(strURL) // MSDN example
{
    var strResult;
    try
    {
        var WinHttpReq = new ActiveXObject("WinHttp.WinHttpRequest.5.1");
        var temp = WinHttpReq.Open("GET", strURL, false);
        WinHttpReq.Send();
        strResult = WinHttpReq.ResponseText;
    }
    catch (objError)
    {
        strResult = objError + "\n"
        strResult += "WinHTTP returned error: " +
            (objError.number & 0xFFFF).toString() + "\n\n";
        strResult += objError.description;
    }
    return strResult;
}

function hasData(s)
{
  return s.indexOf("href='ladder-show.php?") != -1;
}

function getLadderPages(url)
{
    var o=0;

    for (;;)
    {
        var s = getText(url+"o="+o);
        if (hasData(s))  // o still in range ?
        {
           var re = /<tr.*?<\/tr>/g;
           var result = re.exec(s);
           while (result != null)
           {
              var ss = result[0];
              if (hasData(ss))  WScript.Echo(ss);  // print data lines, not headers
              result = re.exec(s);
           }
        }
        else
            break;
        o = o+1;
    }
}

getLadderPages("http://angband.oook.cz/ladder-browse.php?");
It's a Windows machine, you see.
Now in the command prompt:  cscript /nologo grab.js >dump
After a while there is a 5,5M of a dump, locally stored. I can parse it forward, backward and sideways. That's good.