Tuesday, June 1, 2010

Google Analytic Project (Part 2) - API and Multi-thread process

Google Analytic Service provides APIs for developer to programmatically download the web statistic data. Here is the API technical documentation website http://code.google.com/apis/analytics/docs/gdata/1.0/gdataProtocol.html
Basically, you need the Google account to access the Analytic feature. You need to setup the website that you want to analyze and you need to put the generated Javascript code into your page. You can have multiple websites in your analytic service. Once you have the correct setup, Google Analytic Service will start collecting your website data. When you try to access the data through API, you need to specify the specific table id that is generated when you are setting up the websites and pass along with login account, and password in order for connecting the service. Let’s look at a following Google’s sample,

The above code will issue two connectivities to Google.
Line 88, it creates a new AnalyticsService that is going to be used to fetch the data feed later. Line 90, at this point it is actually connecting to the Google Service with your login ID and Password. Later at line 105, another connection to the Service to get the data feed. As I had noticed this behavior and I found out that we could login once, then reuse the same AnalyticsService object to call the getFeed multiple times. In line 93, it creates a new DataQuery object with the URL. Then from line 94 – line 100, it is setting all the parameters. The TABLE_ID and Metrics are mandatory. The return data feed will contains all the dimension and metrics you had specified. In this case, the data feed will contains entries of “ga:week,ga:pagePath,ga:date, ga:visits,ga:pageviews,ga:timeOnPage”. Line 105 is the point where it is downloading data feed from Google and storing it into your variable.

There are some quota restrictions from Google. The max result by default is 1000, meaning that 1000 records will be download at once. You can specify no more than 10000. The more you specify the slower response you will get. The service also has the segment and filter parameter to filter out data. However, keep in mind that you may have more than 10000 records per day according to your dimension and metrics parameters. So the question is that how do you download the rest after 10000. Fortunately, the service has the start index parameter you can set. So combining start index and the maximum result number, you can download all the records. But, another question is that, how do I download all of them at once if I really need to see them nicely presented in a table or I want them to be viewed from the XLS sheet. So to answer this question, you need to create multiple instances to download the data feed and store the data somewhere after you download. So after all the download finish, now you will have all the data. However, you need to know what the total length of records is in order to stop your download. Fortunately, the service has one API named getTotalResults() from the DataFeed class. As I mention earlier, to download a 10000 records, it will take approximately over 30 seconds response from Google in the initial run, but Google did cache these records for later if you re-run the same query with different record number, lets say you want 2000 this time. It will run faster, but this is not true for different start index. The 30 seconds may vary differently at different time and place. Anyway, what I am going to show you is to use the multi-threading for downloading all the data at once and it will speed up your process.

By using multi-threading for this application, there are several design elements I need to consider. The most important thing is to assign the right start index to download and store for each thread in the thread pool. Also, how many threads that needs to be run and that should depend on the max result I mentioned above, which can be specified. Secondly, the Google’s DataQuery and DataFeed need to be treated as local to the thread in order for each thread to use the right query to download and use the right DataFeed to fill in the table. In order to achieve the maximum concurrency, we shouldn’t synchronize the whole transaction, meaning from download to storing data. We need to minimize the synchronized block as less as we can. Start index is very essential, it needs to be mutual exclusive and treat it as thread local. Therefore, we should synchronize between the start index only and inside the block, we should put this start index into either threadLocal variable or a concurrentMap. Here is the snapshot, 

The synchronized block happens from line 161to line 174. Mutex is an integer that will be increased by the max number. Line 172, it is storing the mutex as start index. Behind the code of setStartIndex, it is actually setting the value into the ThreadLocal variable, so that the thread can retrieve its own start index value. In line 173, it increases the mutex value for the next start index value. Between line 163 and 171, it is basically checking the condition saying that if the start index (mutex) is already greater than the length of the total records, if yes than it performs some clean up and exit the while loop, thread ends. You probably notice that I am cleaning up the map for the feedLocal and threadLocalQuery. As I mentioned before that, DataFeed and DataQuery need to be treated as thread local. I was originally using the threadLocal type variables for them, however, it looks like the memory is not effectively managed as by using concurrentMap. Moreover, I could have used different GlicAnalyticQuery instances that wraps the the DataQuery and DataFeed, so then, I don’t need to put them into a map. The reason for re-using the same GlicAnalyticQuery object is that, what if the GlicAnalyticQuery object is big, meaning it may creates so many other objects like resultsets, services or even connections, then you don’t want to create multiple of them. It is similar to the fly-weight pattern. Therefore, by holding the values into a map will serve the same purpose, meaning that each thread will get its own value. ThreadLocal internally is using a map too. Continue to line 177, it is where the download and storing actions are performed by different thread concurrently. Internally, it is using a service to execute the query for the download and after the download finishes, the data will be populated into the two dimensional array. The row index of array is the same as each thread’s start index and the storing length is the same as max result number.
Below is the screenshot for the output:

The main thread is a dummy call to the google service in order to get the total result information, as you see the max-results is only equal to 1. In this case, the totalResult equals to 2330, therefore a length equals to 2330 two dimensional array for the storage will be created. The main thread finish for 3.483 seconds. Next you see three threads are running concurrently – thread 0, 2, 4 and by looking at the url output, thread 0 start-index equals 1, thread 2 start-index is 1001 and thread 4 start-index is 2001. All threads max-results are 1000. Continue, thread 4 finishes for 2.998 seconds, thread 0 finishes for 3.451 secs and thread 2 finishes for 3.592 seconds. The last line is the total seconds that is spent from main thread to the thread that finished the last. It is 7.106 seconds; it should approximately equal to the last thread which is 3.592 and add up the main thread 3.483 second which is 7.076. Let’s consider if we set the maximum result number is 200. That means all threads, 8 of them in the pool will start concurrently, and when one finishes, it will get the next start index. Here is the output snapshot, you should see the thread 6, 2 and 1 is trying to get the next start index after they finished. 

I believe Google Service also has some restrictions for concurrent download. Here is the link
Basically, you can’t issue 10 requests at a second, which is why I only give 8 threads in the pool. 4 pending requests at any given time (i.e. you must wait until your 1st request completes before making a 5th request).

For the next blog, I will talk about using the DWR to fill the data into the table that I have created in this post.