Confusing Stats Terms Explained: Standard Deviation
With this in mind, I decided to compile a list of the most confusing stats terms and describe them in plain English, to clear-up some of the confusion that surrounds them. Initially, this was intended to be a single blog, but I soon realized far too many words are required to adequately explain this list in one entry, so I’ve decided to present them over a series of entries. I hope this will allow me to offer thorough explanation and examples.
Confusing Stats Terms Explained: Standard Deviation
Standard deviation is a descriptive statistic that is used to understand the distribution of a dataset. It is often reported in combination with the mean (or average), giving context to that statistic. Specifically, a standard deviation refers to how much scores in a dataset tend to spread-out from the mean.
A small standard deviation (relative to the mean score) indicates that the majority of individuals (or data points) tend to have scores that are very close to the mean (see figure below). In this case, cases may look clustered around the mean score, with only a few scores farther away from the mean (probably outliers).
By contrast, a sample with a large standard deviation (relative to the mean score) tends to have cases that are more widely spread-out from the mean (see figure on right), perhaps with only a few cases actually having scores that fall close to the mean.
You may be wondering to yourself: “Why should I care about the standard deviation?” The answer to that question is context. To really understand the basic characteristics of a dataset, you must put your statistics in context.
Allow me to demonstrate:
For the sake of demonstration, imagine we have two samples of chocolate cake eaters, each sample with 10 people, self-reporting how many pieces of chocolate cake they've eaten in the last seven days.
- In dataset #1, we have five people that report eating 4 pieces of cake and five people that report eating 6 pieces of cake, for a mean of 5 pieces of cake
- (4+4+4+4+4+6+6+6+6+6)/10 = 5
- Mean (Average) = 5
- (4+4+4+4+4+6+6+6+6+6)/10 = 5
- In dataset #2, we have five people that report eating 0 piece of cake and five people that report eating 10 pieces of cake, for a mean of 5 pieces of cake
- (0+0+0+0+0+10+10+10+10+10)/10 = 5.
- Mean (Average) = 5
- (0+0+0+0+0+10+10+10+10+10)/10 = 5.
In this case the datasets are mathematically similar, but the mean of the two samples is somewhat deceptive. In fact, the mean statistic can be a deceptive little bugger in general, when it is not presented in context. That is where a standard deviation comes in!
Now, you might be thinking: “Why not just look at the raw data and come to that conclusion? After all, you just came to that conclusion without ever talking about the standard deviation!”
Well, that is fine as long as you only have ten people in each sample AND as long as your sample is so neatly, cleanly, and clearly organized into moderate values and extreme values, as it is here. If that is the case, then you likely can get a perfectly firm grasp on your data without ever knowing the standard deviation! Unfortunately, data is rarely that clear and samples sizes can be in the hundreds, thousands, or even millions, making it impossible to "eye-ball" the data and draw reliable conclusions.
When these instances arise (which will be almost every time you work with data), your friendly standard deviation can give you the context you need. Let's consider the standard deviations of our chocolate cake datasets. Knowing that larger values of standard deviation are indicative of more points "spread" away from the mean, compared to smaller standard deviation values (as discussed in our first paragraph), which sample (#1 or #2) would you expect to have a larger standard deviation?...
(I'll pause while you ponder your answer and then answer out-loud at a volume just loud enough to make the person sitting next to you wonder if you are stable and if it is safe to sit next to you...)
To calculate the standard deviation for your data, well let's face it, all you need to do is use SPSS (or any other statistical software package). In SPSS, you can obtain the standard deviation by:
- In the menu bar, go to: Analyze -> Descriptives ->Options -> check the Standard Deviation Box, if it isn't already [should be by default]-> click OK-> move the variable you want to calculate from the list of variables in the left dialogue box to the empty dialogue box on the right-> click OK.
- Subtract your mean score from every person's actual (observed) score
- Square those difference scores for each person
- Add those values together for the whole sample
- Divide that sum by the number of cases in your data (10 in our case)
- Finally, calculate the square root of the number calculate in step #4
- In dataset #1, we have five people that report eating 4 pieces of cake and five people that report eating 6 pieces of cake, for a mean of 5 pieces of cake ([4+4+4+4+4+6+6+6+6+6]/10=5).
- Mean =5; Standard Deviation = 1
- In dataset #2, we have five people that report eating 0 piece of cake and five people that report eating 10 pieces of cake, for a mean of 5 pieces of cake ([0+0+0+0+0+10+10+10+10+10]/10=5).
- Mean = 5; Standard Deviation = 5
From this example , we can see that the standard deviation is critical to understanding your data, by putting your mean statistic in context, in this case indicating that the mean for the dataset #2 is not a very meaningful or useful statistic for understanding the eating tendencies of individuals in that dataset.
A few closing notes about standard deviations:
- A dataset's variance can be calculated by simply squaring the standard deviation.
- Variance = (standard deviation)2
- Variance will be the topic of a future "Confusing Stats Terms" blog, so you'll see thenwhy we care about both for different reasons...
- One standard deviation above and below the mean is expected to include about 68% of the participant's scores in your dataset (assuming your distribution is normal).
- Two standard deviations above and below the mean would be expected to include 95% of the values in your dataset (assuming your distribution is normal).
- Three standard deviations above and below the mean would be expected to include 99.7% of the values in your dataset (assuming your distribution is normal).
- As eluded to earlier, standard deviations should only be calculated for interval data (also true for a mean score).
- Interval data is data that is numeric and hold an intrinsic and consistent value between values (such as 1 to 2 represents an equal increase to 2 to 3 or 3 to 4..etc).
Editorial Note: Stats Make Me Cry is owned and operated by Jeremy J. Taylor. The site offers many free statistical resources (e.g. a blog, SPSS video tutorials, R video tutorials, and a discussion forum), as well as fee-based statistical consulting and dissertation consulting services to individuals from a variety of disciplines all over the world.
Reader Comments (53)
i'm trying to analyze a data in a call center industry that was give to me by a friend for practice purposes.My purpose is to identify key strengths and weaknesses.To make it simple, I want to start with 2 variables which I want to know if they are affecting each other,or should i say correlated.So my first step is compare Handle Time and Call Volume. My goal is to identify if customer support tends to have high Handle Time (time that the cu support spends over the phone) if there are so many calls.I was thinking that if a customer support knows that there are many incoming calls, then perhaps he would spend long time with one customer to prevent taking in another call, something like that. And so below are the results,
Descriptive Statistics
N Range Minimum Maximum Sum Mean Std. Deviation Variance Skewness Kurtosis
Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Statistic Statistic Std. Error Statistic Std. Error
handle time 21 3213.00 45.00 3258.00 14319.00 681.8571 138.94402 636.72147 405414.229 3.540 .501 14.729 .972
call volume 21 42.00 21.00 63.00 606.00 28.8571 2.00272 9.17761 84.229 3.058 .501 10.147 .972
Valid N (listwise) 21
summary:
handle time
std dev=636.72147
mean=681.8571
call volume
std dev=28.8571
mean=9.17761
Well I got lost.I do see that Std Deviation for handle time is big,but how big is big,and how small is small.? Any insights on how I can interpret these results and come up with an answer with my hypothesis?
I'm not sure i fully understand your question, Rin. If you want to examine the association between two variables, you need to run a bivariate correlation analysis, not just examine Mean and SD, as that is not going to demonstrate association. What statistical platform are you using?
Awesome job! Thanks so so much!