Tuesday, July 13, 2010

Google Analytic Continue

After finishing my first round of my Google Analytic Project, I suddenly had an idea, why not compare my own weekly aggregation results to the one provided by Google and to see if I am getting correct calculations. So I started to use the Google Analytic Query Explorer, entered all the parameters and get the data back right from the Query Explorer. The next thing I found was that, the data was not matched. I started to investigate. First I ran a query with ga:pagePath in the dimension, that should give me all the user visiting numbers over a week. I copy the data from Query Explorer into XLS and do the SUM function. Well this time I got the same result as in my own calculation. I started to think something fishy from Google. Then I do another test, I found something is more interesting, the visiting number for a particular path in the same week is displayed differently for the different time range. In other word, they should have displayed the consistent data for the same week regardless the difference of time range. So I am guessing that it is worth of opening a ticket to Google. So I did and after a while, I am happy that they are accepting the ticket and investigating the issue. The ticket can be found in the following link
http://code.google.com/p/gdata-issues/issues/detail?id=2100

Tuesday, June 1, 2010

Google Analytic Project (Part 2) - API and Multi-thread process

Google Analytic Service provides APIs for developer to programmatically download the web statistic data. Here is the API technical documentation website http://code.google.com/apis/analytics/docs/gdata/1.0/gdataProtocol.html
Basically, you need the Google account to access the Analytic feature. You need to setup the website that you want to analyze and you need to put the generated Javascript code into your page. You can have multiple websites in your analytic service. Once you have the correct setup, Google Analytic Service will start collecting your website data. When you try to access the data through API, you need to specify the specific table id that is generated when you are setting up the websites and pass along with login account, and password in order for connecting the service. Let’s look at a following Google’s sample,

The above code will issue two connectivities to Google.
Line 88, it creates a new AnalyticsService that is going to be used to fetch the data feed later. Line 90, at this point it is actually connecting to the Google Service with your login ID and Password. Later at line 105, another connection to the Service to get the data feed. As I had noticed this behavior and I found out that we could login once, then reuse the same AnalyticsService object to call the getFeed multiple times. In line 93, it creates a new DataQuery object with the URL. Then from line 94 – line 100, it is setting all the parameters. The TABLE_ID and Metrics are mandatory. The return data feed will contains all the dimension and metrics you had specified. In this case, the data feed will contains entries of “ga:week,ga:pagePath,ga:date, ga:visits,ga:pageviews,ga:timeOnPage”. Line 105 is the point where it is downloading data feed from Google and storing it into your variable.

There are some quota restrictions from Google. The max result by default is 1000, meaning that 1000 records will be download at once. You can specify no more than 10000. The more you specify the slower response you will get. The service also has the segment and filter parameter to filter out data. However, keep in mind that you may have more than 10000 records per day according to your dimension and metrics parameters. So the question is that how do you download the rest after 10000. Fortunately, the service has the start index parameter you can set. So combining start index and the maximum result number, you can download all the records. But, another question is that, how do I download all of them at once if I really need to see them nicely presented in a table or I want them to be viewed from the XLS sheet. So to answer this question, you need to create multiple instances to download the data feed and store the data somewhere after you download. So after all the download finish, now you will have all the data. However, you need to know what the total length of records is in order to stop your download. Fortunately, the service has one API named getTotalResults() from the DataFeed class. As I mention earlier, to download a 10000 records, it will take approximately over 30 seconds response from Google in the initial run, but Google did cache these records for later if you re-run the same query with different record number, lets say you want 2000 this time. It will run faster, but this is not true for different start index. The 30 seconds may vary differently at different time and place. Anyway, what I am going to show you is to use the multi-threading for downloading all the data at once and it will speed up your process.

By using multi-threading for this application, there are several design elements I need to consider. The most important thing is to assign the right start index to download and store for each thread in the thread pool. Also, how many threads that needs to be run and that should depend on the max result I mentioned above, which can be specified. Secondly, the Google’s DataQuery and DataFeed need to be treated as local to the thread in order for each thread to use the right query to download and use the right DataFeed to fill in the table. In order to achieve the maximum concurrency, we shouldn’t synchronize the whole transaction, meaning from download to storing data. We need to minimize the synchronized block as less as we can. Start index is very essential, it needs to be mutual exclusive and treat it as thread local. Therefore, we should synchronize between the start index only and inside the block, we should put this start index into either threadLocal variable or a concurrentMap. Here is the snapshot, 


The synchronized block happens from line 161to line 174. Mutex is an integer that will be increased by the max number. Line 172, it is storing the mutex as start index. Behind the code of setStartIndex, it is actually setting the value into the ThreadLocal variable, so that the thread can retrieve its own start index value. In line 173, it increases the mutex value for the next start index value. Between line 163 and 171, it is basically checking the condition saying that if the start index (mutex) is already greater than the length of the total records, if yes than it performs some clean up and exit the while loop, thread ends. You probably notice that I am cleaning up the map for the feedLocal and threadLocalQuery. As I mentioned before that, DataFeed and DataQuery need to be treated as thread local. I was originally using the threadLocal type variables for them, however, it looks like the memory is not effectively managed as by using concurrentMap. Moreover, I could have used different GlicAnalyticQuery instances that wraps the the DataQuery and DataFeed, so then, I don’t need to put them into a map. The reason for re-using the same GlicAnalyticQuery object is that, what if the GlicAnalyticQuery object is big, meaning it may creates so many other objects like resultsets, services or even connections, then you don’t want to create multiple of them. It is similar to the fly-weight pattern. Therefore, by holding the values into a map will serve the same purpose, meaning that each thread will get its own value. ThreadLocal internally is using a map too. Continue to line 177, it is where the download and storing actions are performed by different thread concurrently. Internally, it is using a service to execute the query for the download and after the download finishes, the data will be populated into the two dimensional array. The row index of array is the same as each thread’s start index and the storing length is the same as max result number.
Below is the screenshot for the output:


The main thread is a dummy call to the google service in order to get the total result information, as you see the max-results is only equal to 1. In this case, the totalResult equals to 2330, therefore a length equals to 2330 two dimensional array for the storage will be created. The main thread finish for 3.483 seconds. Next you see three threads are running concurrently – thread 0, 2, 4 and by looking at the url output, thread 0 start-index equals 1, thread 2 start-index is 1001 and thread 4 start-index is 2001. All threads max-results are 1000. Continue, thread 4 finishes for 2.998 seconds, thread 0 finishes for 3.451 secs and thread 2 finishes for 3.592 seconds. The last line is the total seconds that is spent from main thread to the thread that finished the last. It is 7.106 seconds; it should approximately equal to the last thread which is 3.592 and add up the main thread 3.483 second which is 7.076. Let’s consider if we set the maximum result number is 200. That means all threads, 8 of them in the pool will start concurrently, and when one finishes, it will get the next start index. Here is the output snapshot, you should see the thread 6, 2 and 1 is trying to get the next start index after they finished. 

 
I believe Google Service also has some restrictions for concurrent download. Here is the link
Basically, you can’t issue 10 requests at a second, which is why I only give 8 threads in the pool. 4 pending requests at any given time (i.e. you must wait until your 1st request completes before making a 5th request).

For the next blog, I will talk about using the DWR to fill the data into the table that I have created in this post.

Thursday, May 27, 2010

Google Analytic Project (Part 1) UI - fixed header table with dynamic data

Recently, I am doing a project that involves display a custom analytic web trend report for our client. The data needs to be retrieved by using Google Analytic Service.
Google Analytics Service provides API to retrieve all the statistics for different dimensions such as by month, week, date, etc. and combined with different metrics such as visits, pageviews, etc. You can also create your own report. However, there is no way to create the report that meets our client’s requirement.
The client requires a special way to view the data. They need all the weekly or monthly aggregation data to be viewed from left panel and they need to see the visiting and viewing number for each URL on the right hand panel, as shown in figure 1 below. I will talk about the front UI implementation in my first part of the article, which I have developed a dynamic data and grid like table. In the second part, I will talk about the implementation of using Google Analytic API which involves using multi-threading in order to resolve the low latency issue.

From my previous post here, I had been trying to develop a table prototype that is similar to Excel sheet. The table contains left side as fixed column and right hand side that has fixed header and scrollable body. See the figure 1 below.


figure 1.

The setup is pretty easy. All I need is to write two TDs in the table. Symmetrically, both TDs will contain two tables, the first one is for the header and second one is for the body and it is enclosed by the DIV. The table skeleton is show as below:

 

After the html table is in place, then I will align the header and the column by inserting COLGROUP elements into tables depends on how many data columns will be retrieved. I will talk about the implementation later when I cover the dynamic data retrieval part. Note that all the DOM manipulation is done by using JQuery. By using COLGROUP, there is a disadvantage, because some column may contains long text with no space and it causes the column width is longer than the width that is specified in the COLGROUP. Therefore, the header and column are not aligned probably. In order to resolve this issue, I need to dynamically calculate the table width to make sure to allocate enough width for it. Note that I can add the 100% to the width, because the columns are way more than 100% can fit in. Also, I add the table-layout:fixed style in to the table to prevent this happened. The long text with no space will be truncated. The attribute seems not working well in Firefox. But it can be fixed.


Let’s go back to the right hand side tables. The first table enclosed by the DIV whose id is “divHeader” is setup for the headers. Notice that divHeader has fixed width 500 and overflow is hidden. This setup allows the header to be scrolled when the fnScroll function called. The second one enclosed by DIV whose id is “table_div” is for the body. Because this DIV has some attributes such as width, height and overflow equals to scroll, etc, as long as the table is longer than the div, the scroll bar will appear. One more important thing to make the header and body column all scrolling together is the fnScroll function. The code is show below.


 fnScroll = function(){
    var leftPos = $('#table_div').scrollLeft();
    $('#divHeader').scrollLeft(detectLoadingCell(leftPos));
    $('#firstcol').scrollTop($('#table_div').scrollTop());
}
ScrollLeft and scrollTop are the JQuery functions. They can take the position as a parameter and also can return the position. Here,  the leftPos has been return by the table_div element as it is scrolling and it is passed into divHeader’s scrollLeft function. Therefore, both divHeader and table_div are scrolled into the same position. This is true for the scrollTop. Because when the right hand side table scrolls, the left hand side table also needs to be scrolled too. The function detectLoadingCell is used to detecting those cells that has Loading… text. It is the point where the program will make the DWR call to the server to fetch more columns. Here is the code below

function detectLoadingCell(scrollLeftPos)
{
      if ($('#loading').length == 0)
      {
            return scrollLeftPos;
      }
      var ele = $('#loading').prev();
      var loadingPos = ele.position().left + ele.outerWidth();
      if (loadingPos < $("#table_div").outerWidth())
      {
            if (!loadOnce)
            {
                  loadOnce = true;
                  ReportGenerator.appendMoreUrlData($("#table1 td[id!='loading']").length, handleMoreUrlData  );
            }
      }
      return scrollLeftPos;
}
Basically, it is detecting the scroll left position whether it is reaching to the loading element position. Initially, the program will add 10 loading elements if the data column is bigger than the value specified in “# of display urls”. The reason by doing this is because, thousands urls can be viewed for a particular time, and for each one we need to provide the number of visiting and viewing. IE will be hanging for a long time in order to load it fully, firefox is much better on this. However, it still need some time to render. So I decide to implement some feature I like in the Yahoo email that is dynamic loading data as scrolling. In the next version, I will also add the paging and searching functionality in order to improve the UI usability.

The function ReportGenerator.appendMoreUrlData will be covered later. Basically it fires an AJAX call, create table elements such as td, colgroups and append them into the existing rows.

Lets cover some JQuery functions for adjusting the tables.
There are three main functions, see below
fnAdjustTable = function()
{
      adjustTableWidth(totalCol);
      adjustTable('#fixHeader','#fixHeader2','#firstcol','#table_div');
      adjustTable('#fixHeader2','#fixHeader','#table_div','#firstcol');
}
This function is called after the DWR AJAX called is completed.
adjustTableWidth, as I mentioned before, I need to calculate the right hand side table width base on how many data were retrieved.
function adjustTableWidth(totalColumn)
{
      var tableWidth = (totalColumn / 2 * (smallCol + largeCol)) ;
      $('#table1').css('width',tableWidth);
      $('#table2').css('width',tableWidth);
}
adjustable() is for matching the rows including headers from both left hand table and right hand table.
function adjustTable(header1,header2,div1,div2)
{
      var n = 0;
      var colCount = $(header2+'>tr td').length; //get total number of right hand part column
      $(div1+'>table tr ').each(function(i){
                  var height1 = $(this).outerHeight();
                  var height2 = $(div2+' td:eq('+colCount*n+')').outerHeight();
                  if (height1 != null && height2!=null)
                  {
                        if (height1 != height2)
                        {
                               $(this).css('height',height2);
                        }         
                  }
    n++;
  });
  adjustTableHeader(header1,header2);    
}

function adjustTableHeader(header1,header2)
{
      if($(header1).outerHeight()<$(header2).outerHeight())
      {
            $(header1).css("height",$(header2).outerHeight());
      }
}

Alright, now I will talk about using DWR AJAX to download the data and append them into right hand table.
First DWR call happens in the code below,
function update() {
      var param = {};
      param["STARTTIME_PARAM"] = $("#datepicker").val();
      param["ENDTIME_PARAM"] = $("#datepicker2").val();
      param["TESTONLY_PARAM"] = $("#testonly").val();
      param["MAXRESULT_PARAM"] = $("#maxResult").val();
      param["GROUPBY_PARAM_NAME"] = $("input[name='groupbyRadio']:checked").val();
      loadOnce = false;
      $('#table_div').scrollLeft(0);
      ReportGenerator.generateOverViewReport($("#numURL").val(),param, handleData  );
}

The first parameter of ReportGenerator.generateOverViewReport function is the number of url that will be displayed. Second param is a javascript object that is equivalent to Java HashMap, the third param is the call back function. Let’s take a look at the handleData function. Basically the handleData function returns a ReportData java class that is serialized into a javascript object by DWR and ReportData contains aggregate weekly/Monthly Data, total aggregation data and Url Data.
Here is the code below,



Line 305 – 308, using DWR Utility to delete all rows. Line 309 - 310, getting the total data columns and rows. Line 311, populate left hand side table with aggregateData in the data object. Line 312, populate right hand side table with urlDisplayData inside of Data Object. Line 313, print out total data. Line 314, adjust table row height as mentioned before.
Here is the method for populateAggregateColumn. Notice that I didn’t use the JQuery methology to get the tbody and tHeader and I use strictly DOM API to populate all the table rows and columns. The reason is that JQuery is almost 2 times slower. My machine has 4GB RAM and dual CPU running on XP. For example, the operation that involve creating COLGROUP, TR, TD and appending all these elements to a 39 x 105 table , JQuery needs to take 6.5 seconds to finish; on the other hand, DOM API only takes 3.8 to finish. However, the memory for using DOM API has almost 10mb more than JQuery. I also try to use the cloneNode from DOM API, but got not much difference.



Line 343, populate table headers. Line 344 insert COLGROUP elements into table. Line 345 – 348, for each row of data records, append newly created TR with all the data TDs. The appendColumns function in line 347 is reused by populateHeaders function. Finally in line 349, it will populate the rest of blank column grids.
Here is the method for appanedColumns. In line 401, it will assign the id equals “loading” to the TD element if the value equals to “loading…” As mentioned before, the server will automatically add 10 loading values into the records if the number of display url is less then the total record values count.




Here is the method of populateBlankColumns. This function is also called from the handleMoreUrlData. When the handleMoreUrlData finish filling the fetched data, then we also need to add the blank columns to the rest of the rows. The rowStart is the start point of the existing blank row that needs to append more blank columns.
When the user scrolls right hand side table, the ReportGenerator.appendMoreUrlData function will be called from detectLoadingCell method as I have mentioned before. This is the DWR AJAX call. The first parameter is the offset of record, which means it is a starting point for loading the data. The second parameter handleMoreUrlData is a call back function.

Here is the code for handleMoreUrlData

In line 114, if it checks the data equals null, that means there are no more data to fetch, it will clean up.

Here is the code for removeLoadingTds


JQuery provides an easy way to get the index of the element and remove them. The selector are very effective and powerful.
Line 97, it calculates the index of loading TD.
Line 98 and 99 remove all the loading TDs. Line 100 and 101 remove all the loading colgroups. Line 102 – 104, it is removing all extra blank columns for the rest rows. Line 105, adjust table width.

InsertUrlData is going to append the newly created COLGROUPS and TDs with fetched data into the existing data row.

Here is the code for insertUrlData


Two points, in line 139, it will try to ignore the values with “loading…” or empty. Second, at the end, it will store the number of columns that are inserted and number of rows of existing data into a object and then the object will be return so that, the next function populateBlankColumns will have row and column number to start with.

So far, I have talked about in a great detail of the implementation for this table. For the tab control that you see in the figure 1, it was used by the adobe UI framework, which I have been using it for my other projects. The UI also display a chart to the user. There are two type of chart for now, one is the horizontal bar chart for weekly report and the other is pie chart for the monthly report as shown in below figure 2 and 3



 That is for now. You may see my part 2 post here.