<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://whijo.net" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>statistics</title>
 <link>http://whijo.net/taxonomy/term/63/feed</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>On Internet statistics</title>
 <link>http://whijo.net/blog/brad/2008/07/23/internet-statistics.html</link>
 <description>&lt;p&gt;Because I am on the internet, and because the apps I use offer statistics, I figure I would share some of them, feel free to do the same, if you feel the urge.&lt;/p&gt;
&lt;!--break--&gt;



&lt;h3&gt;&lt;a href=&quot;http://www.google.com/reader/shared/09277211261723009362&quot;&gt;Google Reader&lt;/a&gt;&lt;/h3&gt;
&lt;div id=&quot;trends-item-count-header&quot;&gt;From your &lt;b&gt; 157  subscriptions&lt;/b&gt;, over the last 30 days &lt;b&gt;you read  1,698  items&lt;/b&gt;, &lt;b&gt;starred  17  items&lt;/b&gt;, &lt;b&gt;shared  48  items&lt;/b&gt;, and &lt;b&gt;emailed  0  items&lt;/b&gt;.&lt;/div&gt;



&lt;h3&gt;&lt;a href=&quot;http://www.last.fm/user/d-arb&quot;&gt;Last.fm&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;16,205 plays since 8 Aug 2006&lt;/p&gt;
&lt;table class=&quot;lfmWidgetchart_f944d7891eecea7d507dd547ac225ef2&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; border=&quot;0&quot; style=&quot;width:184px;&quot;&gt;&lt;tr class=&quot;lfmEmbed&quot;&gt;&lt;td&gt;&lt;object type=&quot;application/x-shockwave-flash&quot; data=&quot;http://cdn.last.fm/widgets/chart/19.swf&quot; codebase=&quot;http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=7,0,0,0&quot; id=&quot;lfmEmbed_988697581&quot; width=&quot;184&quot; height=&quot;140&quot;&gt; &lt;param name=&quot;movie&quot; value=&quot;http://cdn.last.fm/widgets/chart/19.swf&quot; /&gt; &lt;param name=&quot;flashvars&quot; value=&quot;type=topartists&amp;amp;user=d-arb&amp;amp;theme=red&amp;amp;lang=en&amp;amp;widget_id=chart_f944d7891eecea7d507dd547ac225ef2&quot; /&gt; &lt;param name=&quot;allowScriptAccess&quot; value=&quot;always&quot; /&gt; &lt;param name=&quot;allowNetworking&quot; value=&quot;all&quot; /&gt; &lt;param name=&quot;allowFullScreen&quot; value=&quot;true&quot; /&gt; &lt;param name=&quot;quality&quot; value=&quot;high&quot; /&gt; &lt;param name=&quot;bgcolor&quot; value=&quot;d01f3c&quot; /&gt; &lt;param name=&quot;wmode&quot; value=&quot;transparent&quot; /&gt; &lt;param name=&quot;menu&quot; value=&quot;true&quot; /&gt; &lt;/object&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;



&lt;h3&gt;&lt;a href=&quot;http://twitter.com/darb&quot;&gt;Twitter&lt;/a&gt;&lt;/h3&gt;
&lt;ul class=&quot;stats&quot;&gt;
  &lt;li&gt;&lt;span class=&quot;label&quot;&gt;
      &lt;a href=&quot;/friends&quot; rel=&quot;me&quot;&gt;
      Following
  &lt;/a&gt; &lt;/span&gt; &lt;span id=&quot;followingcount&quot; class=&quot;stats_count numeric&quot;&gt;30&lt;/span&gt;

&lt;/li&gt;



    
&lt;li&gt;&lt;span class=&quot;label&quot;&gt;
      &lt;a href=&quot;/followers&quot; rel=&quot;me&quot;&gt;
      Followers
  &lt;/a&gt; &lt;/span&gt; &lt;span id=&quot;follower_count&quot; class=&quot;stats_count numeric&quot;&gt;51&lt;/span&gt;
&lt;/li&gt;


    
&lt;li&gt;&lt;span class=&quot;label&quot;&gt;
      &lt;a href=&quot;/favorites&quot; rel=&quot;me&quot;&gt;
      Favorites
  &lt;/a&gt; &lt;/span&gt; &lt;span id=&quot;favourite_count&quot; class=&quot;stats_count numeric&quot;&gt;0&lt;/span&gt;
&lt;/li&gt;


    &lt;li&gt;&lt;span class=&quot;label&quot;&gt;&lt;a href=&quot;/direct_messages&quot;&gt;Direct Messages&lt;/a&gt;&lt;/span&gt; &lt;span id=&quot;message_count&quot; class=&quot;stats_count numeric&quot;&gt;5&lt;/span&gt;&lt;/li&gt;

    &lt;li&gt;&lt;span class=&quot;label&quot;&gt;&lt;a href=&quot;/account/archive&quot;&gt;Updates&lt;/a&gt;&lt;/span&gt; &lt;span id=&quot;update_count&quot; class=&quot;stats_count numeric&quot;&gt;547&lt;/span&gt;&lt;/li&gt;
   
  
&lt;/ul&gt;</description>
 <comments>http://whijo.net/blog/brad/2008/07/23/internet-statistics.html#comments</comments>
 <category domain="http://whijo.net/tags/geek">geek</category>
 <category domain="http://whijo.net/tags/google-reader">google reader</category>
 <category domain="http://whijo.net/tags/last-fm">last.fm</category>
 <category domain="http://whijo.net/tags/statistics">statistics</category>
 <category domain="http://whijo.net/geek-tags/statistics">statistics</category>
 <pubDate>Wed, 23 Jul 2008 10:05:35 +0200</pubDate>
 <dc:creator>brad</dc:creator>
 <guid isPermaLink="false">444 at http://whijo.net</guid>
</item>
<item>
 <title>Statistics logging for Django - part 2</title>
 <link>http://whijo.net/blog/brad/2007/07/29/statistics-logging-django-part-2.html</link>
 <description>In &lt;a href=&quot;http://whijo.net/blog/brad/2007/07/19/statistics-logging-django.html&quot;&gt;part 1&lt;/a&gt; I explained how to build middleware and an associated model to capture page accesses, and tie them to a user session. Now that we have all this useful info logged we need to do something with it, like, display it. Unfortunately Django doesn&#039;t have a facility for using GROUP BY with mysql, so you have two major choices (there are more but we can ignore them): implement a custom request in a &lt;a href=&quot;http://www.djangoproject.com/documentation/model-api/#managers&quot;&gt;custom Manager&lt;/a&gt; (see &lt;a href=&quot;http://www.djangosnippets.org/snippets/236/&quot;&gt;snippet&lt;/a&gt; and &lt;a href=&quot;http://www.djangosnippets.org/snippets/1/&quot;&gt;snippet&lt;/a&gt;, or &lt;a href=&quot;http://www.djangosnippets.org/tags/group-by/&quot;&gt;tagged snippets&lt;/a&gt;), or exploit a &lt;a href=&quot;&quot;&gt;mysql view&lt;/a&gt; and model it in Django. Now for me I prefer the latter because it means my custom sql becomes a mysql customisation and as far as Django is concerned it is dealing with a normal table (but don&#039;t tell Django that it is read only), and thus the model code works, so subsequent queries and manipulations can exploit the &lt;acronym title=&quot;Object Relational Manager&quot;&gt;ORM&lt;/acronym&gt; easily. My subjective and non-scientific experience is that using views is a lot more efficient/quick than using custom queries in the manager (it probably has to do with whatever optimisations exist with views, and the fact that you only fetch items when Django decides you need to fetch a row). So, how the hell do we do it?
&lt;!--break--&gt;
First I created a model that describes what information I want to deal with (something which maps neatly on to our other model):
&lt;pre&gt;&lt;code&gt;class UserActivity(models.Model):
        session = models.OneToOneField(Session,
                                        db_index=True, 
                                        null=True,blank=True,
                                        primary_key=True)
        user = models.ForeignKey(User,null=True,blank=True)
        date = models.DateTimeField(
                       help_text=&quot;Date Request started processing&quot;,
                       auto_now_add=True,
                       db_index=True)
        processing_time = models.IntegerField(
                       help_text=&quot;Total time spent on this user&quot;)
        requests = models.IntegerField(
                       help_text=&quot;Total Requests in this session&quot;)
        stats = UserActivityManager()
        def __str__(self):
                return &#039;%s: %s %s - %s - %s&#039; % (self.user,self.session,self.date,self.processing_time,self.requests)
        class Admin:
                list_display= (&#039;user&#039;,&#039;session&#039;,&#039;date&#039;,&#039;processing_time&#039;,&#039;requests&#039;)&lt;/code&gt;&lt;/pre&gt;

The nice thing about this set up is when we aggregate our activity logs we can pull out random stuff like total processing time for requests for a user/session, along with number of requests/user/session (and thus average request time)

But that is just our model, we still need the magic. To implement the magic nicely I put some custom initial SQL into the sql directory of my application (in my case the housing application for this is called accounts, so I make a file called accounts/sql/useractivity.sql), you can read more about initial data &lt;a href=&quot;http://www.djangoproject.com/documentation/model-api/#providing-initial-sql-data&quot;&gt;here&lt;/a&gt;, &lt;a href=&quot;http://www.djangoproject.com/documentation/models/fixtures/&quot;&gt;Django fixtures&lt;/a&gt;).My SQL looks like this:
&lt;pre&gt;&lt;code&gt;DROP TABLE accounts_useractivity;
CREATE OR REPLACE VIEW accounts_useractivity AS 
SELECT i.session_id,
       i.user_id,
       MAX(i.date) as date,
       sum(i.request_time) AS processing_time, 
       count(*) AS requests 
FROM accounts_activitylog i 
GROUP BY 1 
ORDER BY NULL;
&lt;/code&gt;&lt;/pre&gt;
So first I tell mysql to drop the table that django just created (accounts_useractivity), and create a view in it&#039;s place. The view is very simple, in that it just GROUP BY the session_id. The real hair puller for me was figuring out that I needed to use the MAX(i.date) (see more about &lt;a href=&quot;http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html&quot;&gt;aggregate functions&lt;/a&gt;) to get the most recent access to float to the top when it normalises the data (otherwise the GROUP BY normally ORDER BY the session_id, which helps no one), the ORDER BY NULL is &lt;a href=&quot;http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html&quot;&gt;an optimisation&lt;/a&gt; to tell GROUP BY not to ORDER BY. I am hoping that because date is an INDEX (from our logging model) it shouldn&#039;t cost too much to do a MAX. (I would like someone with Much MYSQL-fu to point out any further optimisations to this, or even alternative approaches to the whole thing).

So now we have an aggregating VIEW which Django maps using it&#039;s ORM, so that to figure out sessions which have been active in the last x minutes (where x is a datetime.timedelta object) we simply do a:
 &lt;pre&gt;&lt;code&gt;UserActivity.objects.get_query_set().filter(date__gte=datetime.now()-x)&lt;/code&gt;&lt;/pre&gt;

I wrote a custom manager for getting recent sessions etc., but that is an exercise for the reader. What I did include in my model is something which returns a stepped &quot;request_weight&quot; i.e. session requests / largest session request x steps, which in my case defaults to 6. This means I can style my users like one would a &quot;&lt;a href=&quot;http://en.wikipedia.org/wiki/Tag_cloud&quot;&gt;tag cloud&lt;/a&gt;&quot;, so very active sessions will grow bigger than less active sessions. I needed to implement a helper function in the custom manager to return the session with the most requests.

The final tip is to use a &lt;a href=&quot;http://www.djangoproject.com/documentation/templates_python/#subclassing-context-requestcontext&quot;&gt;context processor&lt;/a&gt; to make the information available to all your templates, although you could do it with middleware (maybe middleware is the proper way to do it?).</description>
 <comments>http://whijo.net/blog/brad/2007/07/29/statistics-logging-django-part-2.html#comments</comments>
 <category domain="http://whijo.net/geek-tags/django">django</category>
 <category domain="http://whijo.net/tags/geek">geek</category>
 <category domain="http://whijo.net/geek-tags/middleware">middleware</category>
 <category domain="http://whijo.net/geek-tags/mysql">mysql</category>
 <category domain="http://whijo.net/geek-tags/mysql-views">mysql views</category>
 <category domain="http://whijo.net/geek-tags/python">python</category>
 <category domain="http://whijo.net/geek-tags/statistics">statistics</category>
 <pubDate>Sun, 29 Jul 2007 21:52:25 +0200</pubDate>
 <dc:creator>brad</dc:creator>
 <guid isPermaLink="false">110 at http://whijo.net</guid>
</item>
<item>
 <title>Statistics logging for Django</title>
 <link>http://whijo.net/blog/brad/2007/07/19/statistics-logging-django.html</link>
 <description>Last night I built some middleware/models for a django application to log visitor/user activity on the site. The intention is to be able to do better user tracking, and build more comprehensive statistics stored in the mysql db (obviously I am also logging everything with apache). The current set up still needs some periodical scripts to conflate data into statistics. I was thinking of doing a daily-weekly-monthly routine (i.e. once a day stats are conflated for yesterday&#039;s stats, and once a week they are turned into weekly stats, and once a month they are minimised into a monthly overview. It was actually really simple to implement, but I butted my head against some django issues (more at the end).

So, first we build a model to represent a request:
&lt;pre&gt;&lt;code&gt;
class UserActivity(models.Model):
        user = models.ForeignKey(
                      User,
                      null=True, blank=True,
                      db_index=True
               )
        session = models.ForeignKey(
                      Session,
                      db_index=True,
                      null=True, blank=True
                  )
        date = models.DateTimeField(
                      help_text=&quot;Date Request started processing&quot;,
                      auto_now_add=True,
                      db_index=True)
        request_time = models.IntegerField(
                              help_text=&quot;Processing time (in ms)&quot;,
                              null=True, blank=True)
        request_url = models.CharField(maxlength=800,db_index=True)
        referer_url = models.URLField(
                              verify_exists=False,
                              db_index=True,
                              blank=True, null=True)
        client_address = models.IPAddressField(
                              blank=True,null=True)
        client_host = models.CharField(
                              maxlength=256,
                              blank=True,null=True)
        browser_info = models.TextField(null=True,blank=True)
        error = models.TextField(null=True,blank=True)
        def set_request_time(self):
                from datetime import datetime
                self.request_time = (
                                      datetime.now() - 
                                      self.date 
                                    ).microseconds
                self.save()
&lt;/code&gt;&lt;/pre&gt;
(download &lt;a href=&quot;http://whijo.net/files/models.py_.txt&quot; title=&quot;Download: models.py_.txt (1.17 KB)&quot;&gt;models.py_.txt&lt;/a&gt;)

I think the model captures all the relevant info (we tie a request to a session and user, we have the time they made the request (and using middleware we can calculate how long the request took), the referer, and some info about the client).

Most of the fields can be blank/null because we are not always going to have a session (see below), etc.

The function set_request_time is called by the outgoing middleware function (process_response) and just notes how long the request took, and saves the object.

Next we need some middleware to handle the object creation:
&lt;pre&gt;&lt;code&gt;
from datetime import datetime
from django.conf import settings
from my_app.models import UserActivity

class Activity(object):
        def process_request(self,request):
                if request.META.has_key(&#039;HTTP_REFERER&#039;):
                        referer = request.META[&#039;HTTP_REFERER&#039;]
                else:
                        referer = &#039;&#039;

                self.activity = UserActivity(
                        user = request.user,
                        session = request.session,
                        date = datetime.now(),
                        request_url = request.META[&#039;PATH_INFO&#039;],
                        referer_url = referer,
                        client_address = request.META[&#039;REMOTE_ADDR&#039;],
                        client_host = request.META[&#039;REMOTE_HOST&#039;],
                        browser_info = request.META[&#039;HTTP_USER_AGENT&#039;]
                )

        def process_exception(self,request,exception):
                self.activity.error = exception
                self.activity.save()

        def process_response(self,request,response):
                self.activity.set_request_time()
                return response
&lt;/code&gt;&lt;/pre&gt;
(download &lt;a href=&quot;http://whijo.net/files/middleware.py_.txt&quot; title=&quot;Download: middleware.py_.txt (825 bytes)&quot;&gt;middleware.py_.txt&lt;/a&gt;)

You may (or may not) have noticed that we only actually save our model on the outgoing response, so we only have one db write per request. The middleware system is very easy to build for, and is &lt;a href=&quot;http://www.djangoproject.com/documentation/middleware/&quot;&gt;documented here&lt;/a&gt;. The nice thing is the process_exception will keep a record of the exception (but I am not sure if this could be done so it stores more information than just the exception.__str__()?)

To install this you would need to have your model within in an app that is &quot;installed&quot; and &quot;syncdb&quot;. The middleware needs to be placed after the session middleware, for e.g. in settings.py (in MIDDLEWARE_CLASSES):
&lt;pre&gt;&lt;code&gt;    
    &quot;django.middleware.common.CommonMiddleware&quot;,
    &quot;django.contrib.sessions.middleware.SessionMiddleware&quot;,
    &quot;django.contrib.auth.middleware.AuthenticationMiddleware&quot;,
    &quot;league.middleware.Activity&quot;,
&lt;/code&gt;&lt;/pre&gt;

The next step is to build a context_processor that will include some useful stats like who is logged in etc. but that will need a more models, or mysql view or UserActivityManager that does a custom sql request with some &quot;group by&quot; magic. I have not built those parts yet, so I won&#039;t speak about them yet.

&lt;strong&gt;My gripes about this implementation&lt;/strong&gt; doing regular user activity stats is a relatively costly request (you need to do a SELECT COUNT(*) WHERE date&gt;now()-(20 minutes) GROUP BY user). This could be cheapened by having a OneToOne join table with the user table which just has an indexed recent_activity field against a User which is touched every request from that user. To get anonymous user activity we can only really rely on ip addresses, since sessions are not set until a user logs in/logs out, so we would need to do a similar system to the user OneToOne table, and use the REPLACE syntax of mysql (not sure if this is possible using django).

&lt;strong&gt;My gripes about the session middleware&lt;/strong&gt; is that users do not get sessions until they log in/log out. This is good because once of visitor etc. do not get sent a cookie, and you don&#039;t allocate them a session in the DB, but it means unique sessions are more difficult to track because anonymous, first time visitors are only unique by their IP address, and nothing else. I can obviously change this, by setting any session variable for visitors without a session in the process_request of the activity middleware. This is neat because it is an opt in db hit, but after wrestling for ages with session middleware appreciating opt in is something to be done in the sober light of day.

&lt;strong&gt;My gripes about Django&#039;s ORM&lt;/strong&gt; are that there is no neat way to do custom sql requests (the nicest group by sql snippet I have seen is &lt;a href=&quot;http://www.djangosnippets.org/snippets/1/&quot;&gt;this one&lt;/a&gt; because it uses django&#039;s _meta to get the table names). Newer changes in Django introduced the &lt;a href=&quot;http://www.djangoproject.com/documentation/db-api/#extra-select-none-where-none-params-none-tables-none&quot;&gt;extra&lt;/a&gt; parameter, which means less completely custom sql (i.e. you can just append your customisations to the existing sql statement), but it still doesn&#039;t allow you to use very specific stuff like GROUP BY (which not all DBs support). The way to remedy this is to figure out some way you can still send sanitised sql to a db server in an extra statement, while allowing more appended customisations for developers. The alternative is to build group_by functions which either translate to DB specific requests, or do it virtually (much like the transactions infrastructure). I prefer the latter solution because I think GROUP BY is very relevant and very useful, but the latter solution does mean that if your DB doesnt support it, then it could be a very costly operation in python-space. </description>
 <comments>http://whijo.net/blog/brad/2007/07/19/statistics-logging-django.html#comments</comments>
 <category domain="http://whijo.net/tags/development">development</category>
 <category domain="http://whijo.net/geek-tags/django">django</category>
 <category domain="http://whijo.net/tags/geek">geek</category>
 <category domain="http://whijo.net/geek-tags/logging">logging</category>
 <category domain="http://whijo.net/geek-tags/middleware">middleware</category>
 <category domain="http://whijo.net/geek-tags/python">python</category>
 <category domain="http://whijo.net/geek-tags/statistics">statistics</category>
 <enclosure url="http://whijo.net/files/middleware.py_.txt" length="825" type="text/plain" />
 <pubDate>Thu, 19 Jul 2007 12:05:13 +0200</pubDate>
 <dc:creator>brad</dc:creator>
 <guid isPermaLink="false">108 at http://whijo.net</guid>
</item>
</channel>
</rss>
