This was something I came across while trying to calculate expected waiting times for tubes and buses. I’ve collected several months’ worth of transport data and wanted to calculate the mean and variance of waiting times at every station and platform. This can be achieved in a few hours for the tube, but there are over 600 bus routes in London and many more stops, so I needed something more computationally efficient than the naive algorithm that I was using.

After looking around, I found the following recurrance formulas for mean and variance:

[latex]M_k=M_{k-1}+(x_k-M_{k-1})/k[/latex]

[latex]S_k=S_{k-1}+(x_k-M_{k-1})*(x_k=M_k)[/latex]

See: http://www.jstor.org/stable/2286154 for a comparison of different methods, but the original method dates back to a paper from 1962 by B. P. Welford published in Technometrics: http://www.jstor.org/stable/1266577. It’s also in Donald Knuth’s book, “The Art of Computer Programming, Volume 2: Semi-Numerical Algorithms”, so I probably should have read my own copy a bit more carefully. The section on “Numerical Precision” in floating point maths is essential reading for any kind of data-mining or mathematical modelling. Not just because of the mantissa size and “Very big number minus very small number equals no change” problem, but also because I want to use running mean and variance to build an adaptive system that can detect problems in the transport network as they happen.

At the moment, the real-time problem detection system for the tube uses statistics that I have pre-computed, so when a waiting time at a station exceeds what is normal, then it gets flagged on the map as a potential problem. With the bus data calculations being so computationally intensive, it makes more sense to use the running mean and variance formulas in an online system so that it adapts over time to what is considered to be the normal operating point of the system.