Friday 3 December 2010

Basic Intro to Statistics for Analysts

Don’t panic.

Before I was lucky enough to complete the National Intelligence Analyst Training Course at Greater Manchester Polices excellent Sedgley park training facility, I didn’t know my statistics from my elbow. (I fully recommend taking at least the Crime Pattern Analysis course by the way).

 I hadn’t done any proper math in years, and what little statistics I was exposed to in my university degree went completely over my head.

During this course I learned 2 important things.

1.     Statistics doesn’t have to be complicated. Once you understand the very basics, you’re pretty much up to speed with what you need to know day to day as a crime analyst.

2.     Statistics can really help you to identify a worrying trend from a natural “blip”. We all know that crime levels go up and down each month. There is a level of natural variation that would occur even if we did everything this month exactly the same as we did last month. Statistics allows you to tell which increases mean that something important has changed.

What I want to do in this blog is to introduce a very simple statistical tool that I use regularly. Hopefully this will convince those of you who don’t currently use any statistical analysis that there is merit in getting to grips with it a little bit.

I responded to a query from a colleague in the IACA recently by telling her about the technique, and got lots of positive feedback from other analysts, so I hope it can be of help to you also.

Hypothetical Situation

So you are in your office one day and the divisional commander sticks his head through the door. You haven’t seen them early enough, so you haven’t had a chance to hide under the desk.

“We’re taking a look at burglary across the division, and we need to know if any of our areas are showing different trends to the division and need attention” they say, before strolling off.



Put really simply, as analysts we know that we need plot the number of burglaries for each area across time, and see if the lines are behaving in the same way. Simple!

The problem we have is that the lines might not be comparable on the same graph. One area may naturally suffer a higher number every month or the local areas will have much smaller total than the divisional figures so its line will be really high on the graph, whereas the rest will be lower down. This will be even more so if we want to compare against regional and national trends. I’ve put together a simple example dataset below. The division is Oldham, and Werneth, Glodwick and Coldhurst are three towns in within the division (the actual figures are made up).


If I just drop these onto a graph, I get:






 This is technically right, but it’s also about as useful as a hole in a bucket.


But what’s that I hear you say? Excel lets you plot series on the second axis? By crikey I think you’re right!

As the figures for the towns are quite similar I could simply chart these figures and move the three towns over to a secondary axis. I could then play about with the scale for the secondary axis to get the lines placed somewhere similar on the chart and end up with something like this:


Much better! Or is it?


The problem with this is:

1.      I have manipulated the data to make it look like it does. Whilst this is simply for the purpose of making the lines easy to compare, it is hardly the "scientific method" and contains the risk of either misinterpreting the data because of the way it is manipulated or deliberately making it look a particular way to strengthen a particular argument. Oh Dear!

2.     If I wanted to add another data series that was dissimilar to both these (for instance the regional or national rates) I wouldn’t be able to. I've run out of axes!

What we need is some way of changing these numbers so that they are all on the same “scale”. That way, we can just put them all on the one axes on one graph.

The Statistical Bit

Ok, I hope you’re still there, and you’re about ready to do the statistical bit. The next bit isn’t complicated, but it’s important that the concept in the next few lines sinks in. Take a break if you need to. Have a cuppa.

The technique we are going to use is called “Z-Score” or “Standardised Score” (different names, same thing. Z-Score sounds cooler though, and is the more widely used term).

 Z-Score allows you to chart different series of data with wildly different figures on the same chart using only one axis.

This is because instead of showing the actual "counts", it instead shows how far each point in a series is away from the average for that series.

Ok that’s the important bit, so I’m going to say it in a different way if you didn’t get it.

Here’s the data again:


Take October 2009 as an example. When we do a Z-score, what we will be doing is replacing these numbers with a number showing how far each number is for the average for that series. So instead of 1569 being the number in the Oldham column, it will be a number which represents the distance 1569 is from the average for all of Oldhams results. And instead of 38 in Werneth, it will be a number which represents the distance 38 is from the average for all of Werneths results, and so on.

Now the number isn’t as simple as taking the figure away from the average. That wouldn’t help. The number for Oldham would still be different in scale from the other 3 towns. The measurement of distance that we use is something called the Standard Deviation.

There are some very complicated descriptions of what a standard deviation is. The simplest description I can give you is that it is a number which shows you how spread our your numbers are.

Do you remember the phrases “normal distribution” and “bell shaped curve” from school? They probably took the heights of everyone in the class and plotted them on a graph. You end up with something that looks like this shape:



The Standard Deviation tells you how spread out this curve is. In this example, the average is 100, and the Standard Deviation is 15. In a normal distribution (a graph that looks this shape), 68% of the results will be +/- 1 standard deviation, and 95% of the results will be +/- 2 standard deviations.

What the Z-score does is tell you how many Standard Deviations each result is compared to the average for that series.

This means that result is presented on the same scale as every other result, including the results in the other series, because they are all “standard deviations” (hence standardised score!)

The benefits of doing it this way are:

1.     You can put as many different types of data series on the graph as you like, instead of being limited to two

2.     Simply using a secondary axis can sometimes be misleading as the scales for the axis can be manipulated to move the lines up and down and to change their "depth", in relation to each other. Using a z-score graph makes all the data series consistent.


Instead, I calculate the z-scores for each data series.

The first thing to do is calculate the Mean and Standard Deviations for each of my 4 series. The easiest way to do this is to use the MEAN and STDEV functions in Excel.

For each point, I then subtract the Mean and divide the remainder by the Standard Deviation.


Doing this for each point gives me:









Which, when I chart the Z-scores gives me:


This gives me a much more robust graph that clearly shows the relationships between the different data series using 1 axis. It’s also a much stronger scientific and auditable method for comparing the series.

Also, because the y axis gives me the standard deviations, I can draw out the points along each series where statistically significant counts have been recorded.
Remember when I said that in a normal distribution “95% of the results will be +/- 2 Standard Deviations”, well that also means that only 5% of the results will be outside these limits. Therefore, there is a less than 5% chance that a result will be more than 2 standard deviations from the average. In crime analysis, this is enough of a level of significance (some scientific disciplines, such as medicine, require a 1% level of significance meaning that results less than +/- 3 standard deviations are considered normal).
If we find a result that is outside these boundaries, then we say that it is “Statistically Significant”. As a rule of thumb, we should never be saying that we have had “significant” results unless we can show they are “statistically significant”. It’s poor scientific practice. Use “exceptional”, “unusual”, “meaningful” or something else.
So I know from this graph that in June 2009 there was a significant level of offending in Glodwick (maybe we had a very active offender move in to the area), and in January we had near significant low levels of offending across all areas (maybe we ran an operation that month). I can also say that Werneth should be our priority, as it is heading towards recording significantly high levels of burglary.

Hopefully all this makes sense. If you can explain it more simply, share it here! (email me and I will post it).

Believe me when I tell you that it isn’t complicated. If you can get this technique to click in your head, then you have it licked! I find loads of uses for this technique. I compare local trends to national trends, I compare staffing levels to the number of crime in each area to make sure we have the right number of officers in the right place; I compare trends in incidents to trends in crimes. There are loads of uses that I’m sure you can think off.
As always, please feel free to leave a comment or email me, and have an awesome weekend.