Next Upcoming Google+ Hangout: Tuesday, August 27 @ 7PM (CST) - To Participate CLICK HERE

Help Me, Help You...
This form does not yet contain any fields.
    « Trade Your Stats "Truths" for Stats Arguments... | Main
    Monday
    Apr262010

    Data Transformations: statistical voodoo or truth serum for your data?

    Anyone that has taken a statistics class has probably learned about transforming data, at one time or another (although you may be in denial about it). In short, you may want to transform your data if you need to perform a parametric analysis, but the inherent assumptions are violated in your dataset. While this seems simple enough, many researchers are hesitant to employ this tactic of handling non-normally distributed data. Often, they can't quite put their finger on and what bothers them about it, but the idea of "artificially changing their data" leaves many feeling uneasy.

    As a researcher myself, I can relate, as many of the same people that preach the importance of considering analysis assumptions also teach us to feverishly protect the integrity of our data. However, for many of us I believe this may be where our methodological good -intentions lead us astray. While protecting the integrity of our data is indeed of paramount importance, transforming variables for the purposes of reaching normality of distribution is unfairly characterized as a threat to this integrity. In fact, one could easily argue that making inferences from results that are biased by non-normally distributed variables may be a greater threat to the integrity of your data and analysis than any transformation. It is perfectly reasonable for a well-intentioned researcher to worry about the consequences of transformation and for them to be wary of a transformation that subsequently brings a significant result where one once was not present. However, that researcher should rest their worries on the realization that an appropriately applied transformation of data generally raises the likelihood that their test of significance is UNBIASED, while it DOES NOT typically raise the likelihood that one will find significance in the absence of a true relationship (type I error). When no true relationship exists between two variables, a test is no more likely to find significance when its variables are transformed than it is when they are not.

    From a pragmatic perspective, using transformation offers the benefit of allowing a researcher to utilize the techniques that they are familiar with and more likely to apply in their current situation, while minimizing the potential bias non-normally distributed data. In many cases, depending on the type of analysis being used and the design of your study, alternative and possibly more sophisticated techniques may exist for dealing with non-normal data. However, many of these techniques require substantial statistical experience and may be intimidating to many seeking to deal with their assumption problems. While transformation is surely not the answer to all problems with assumptions, or even non-normal data, it is far from "voodoo" and is an attractive alternative to turning a blind-eye to the distribution of the data.

    Editorial Note: Stats Make Me Cry is owned and operated by Jeremy J. Taylor. The site offers many free statistical resources (e.g. a blog, SPSS video tutorials, R video tutorials, and a discussion forum), as well as fee-based statistical consulting and dissertation consulting services to individuals from a variety of disciplines all over the world.





    PrintView Printer Friendly Version

    EmailEmail Article to Friend

    Reader Comments (11)

    One of my exploratory analysis on a large data set on growth, the kolmogrov smirnov test shows significance level of .0000, but Q-Q plots are almost coinciding. how do I interpret on normality of data.

    July 31, 2012 | Unregistered Commentervenbreed

    Hey Venbreed,
    What do you mean by they are "almost coinciding"?

    August 27, 2012 | Registered CommenterJeremy Taylor

    A couple of concrete examples can illustrate the arbitrariness of scales.

    First, there are seemingly equivalent measures that produce different scales, such as Time and Rate. If three people (or groups) take 1, 2, and 3 hours, respectively, to travel 100 kilometres, it would appear that they are equal distance apart in terms of some underlying scale. But if we convert to the seemingly equivalent rate, 100/1 = 100, 100/2 = 50, 100/3 = 33.3, we see that this measure produces a different scaling of the individuals (groups). For example, person (group) 2 is at the mean with the first measure, and below the mean with the second.

    Second, the underlying scale of many measures is obviously variable at some deeper level, including per capita income around the world. Surely no one would think that the psychological or economic impact of an increase in income from $1000 to $2000 is the same as the increase from $60000 to $61000 ot from $1000000 to $1001000. The measure almost demands some sort of transformation (or nonlinear analysis). Same logic applies to many measures.

    September 3, 2012 | Unregistered CommenterJim Clark

    VERY well said, Jim, and a wonderful example! Thank you!

    September 3, 2012 | Registered CommenterJeremy Taylor

    Here is the question that has me confused on my current assignment. I am supposed to type in the appropriate summaries or statistics (inferential statistics refer to the test for P-value or not the p-value)into each cell. He list the levels of measurement from univariate: Nominal Ordinal, and interval Ratio. for these we are to give the summary. descriptive statistic and the inferential statistic. then Bivariate: Nominal and Nominal, Nominal and Ordinal, Nominal and Interval ratio, Ordinal and ordinal, Ordinal and Interval/ratio, Interval/ ratio and Interval/ratio. Giving the summary, Descriptive statistic, and the inferential for each. Can someone explain to me what it is he is asking?

    October 25, 2012 | Unregistered CommenterMIchael Kahler

    Dear Michael,

    It sounds to me like your professor is asking you to identify the appropriate type of analysis and/or statistic that goes with each type of data that he presents. Nominal, ordinal, ratio, and continuous are all different types of data, so they have different types of statistics that are used to describe them and test them. I hope this helps.

    November 3, 2012 | Registered CommenterJeremy Taylor

    In a linear regression model, what is the name given to the unaccounted values?

    January 18, 2013 | Unregistered CommenterBeauty

    Good question. The answer is residuals.

    January 29, 2013 | Registered CommenterJeremy Taylor

    Im confused about one thing. The assumption of normality states that we assume the sampling distributions to be normal. We do not assume that populations or samples are normal. So how does transforming our sample to be normal fix anything about a non-normal sampling distribution? Thanks so much!

    March 2, 2013 | Unregistered Commentercamilo

    We do actually assume that the distribution of most variables would be normal at the population level (central limit theorem). If for some reason you believe the variable would not be normal in the actual population, then it should not be transformed. Instead, we should use non-parametric methods. I hope that helps!

    March 26, 2013 | Registered CommenterJeremy Taylor

    Hi Jeremy, thanks for this blog, it is a great help. My question is how you decide to transform your data rather than use a non-parametric test? Specifically, t-tests vs wilcoxin rank sum comes to mind.

    August 21, 2013 | Unregistered CommenterMike

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>