sql - Distribution of table in time


I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.

I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.

I could do this:

select timefield from entries where uid = ? order by timefield;

and look at every 150th row.

Or I could do 20 separate queries and use limit 1 and offset.

But there must be a more efficient solution...

  • Translate

    Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:

        SELECT @rownum:=@rownum+1 AS rownum, e.*
        FROM (SELECT @rownum := 0) r, entries e) AS e2
    WHERE uid = ? AND rownum % 150 = 0;

  • Translate

    Something like this came to my mind

    select @rownum:=@rownum+1 rownum, entries.* 
    from (select @rownum:=0) r, entries 
    where uid = ? and rownum % 150 = 0

    I don't have MySQL at my hand but maybe this will help ...

  • Translate

    As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.

    SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
    FROM entries
    WHERE uid = ?
    GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
    ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)

    Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):

    SELECT uid
        ,COUNT(*) AS measure
    FROM (
        SELECT uid
            ,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
        FROM entries
    ) AS buckets
    GROUP BY uid
    ORDER BY uid

    If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.

  • Translate


    For whatever reason, your example only works when the where @recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.

    If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:

    select timefield from entries
    where uid = ? and id % 150 = 0 order by timefield;

    Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.

  • Translate

    Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?

  • Translate
    select timefield
    from entries
    where rand() = .01 --will return 1% of rows adjust as needed.

    Not a mysql expert so I'm not sure how rand() operates in this environment.

  • Translate

    For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:

    SELECT percentile_disc(0.95) 
    WITHIN GROUP (ORDER BY response_time) 
    FROM pageviews;

    Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/