Tomas Corcoran wrote- >I wonder whether anyone is willing to put their neck on the line with this >apparently innocuous question. > >It relates to the ongoing discussion regarding whether the mean , >geometric mean or median should be used. > >When measuring neutrophil respiratory burst or phagocytic activity , using >Phagoburst from pharmigen , should you use the absolute difference between >geometric means ( means , medians ) of the control versus the activated >samples , or should you use the ratios. We were mostly using the >differences or ratios of the geometric means but there is a fierce debate >ongoing in our laboratory regarding this issue and preliminary results >from comparative studies show the use of ratios of geometric means coming >out on top. David Coder, in a recent posting worth reading, described the geometric mean as the 'correct' statistic for lognormal distributed data. I don't know that I would be that charitable; the software gives you the geometric mean because it is easy to compute, i.e., it doesn't require transformation of the data between log and linear space, while computing the real mean would. If you have linear data, you get the mean by taking the sum of (channel number * number of events in channel) over the range and divide by the total number of events. Since in logarithmic space, addition and subtraction correspond to multiplication and division, performing the same operation on log scale data gives you a quantity which is the log of the nth root of the product of all the n data points; this is the geometric mean. It is, as I say, simple to compute; taking the arithmetic mean, which is the mean we're all used to, would require transforming all the logarithmic values into linear ones, which tends to be inaccurate when log data are on a 256-channel or 1024-channel scale. If you are actually trying to compare flow data with a bulk assay of some kind - for example, you have determined the total amount of fluorescent label in a solution of 100,000 cells, and you now want to calibrate the flow cytometric fluorescence histogram in terms of molecules of label per channel - you do need to use the arithmetic mean, as Alice Givan recently pointed out, and you therefore need linear data, while you usually have log data. That is why most people trying to do quantitative immunofluorescence use beads bearing known amounts of label or antibody instead of trying to get back to fluorescence measurements in solution (which are, of course, required to calibrate the beads, but that's the bead manufacturers' headache, not the end users'). In respiratory burst or phagocytosis assays, what you want to know is how much more activity than control cells is to be found in your treated populations. You really need to subtract the linear values to get this number, but you don't need the means; the median is a better statistic here, because it is much less subject to the influence of outliers than is the mean. TAKING THE RATIO OF TWO QUANTITIES ON A LOG SCALE IS SO STUPID THAT ANYONE INCLUDING SUCH A CALCULATION IN A PAPER SUBMITTED TO A JOURNAL SHOULD BE BANNED FOR A YEAR FROM SUBMITTING ANOTHER PAPER - but, lucky for so many people, most of the reviewers and editors, even those associated with some really toney journals, are blissfully unaware just how stupid it is. I assume people do this because it seems easy; they figure that, if they don't have linear data, and can't readily subtract the control value from the experimental one, they can take the ratio of the experimental and control values, and effectively make the statement that the treated cells exhibit x times as much activity as the controls, rather than q units more activity than the controls. That aim in itself is not illegitimate. However, when you're working with data on a log scale, you get the ratio of activities by subtraction, not by division, since log (a/b) = log a - log b. If instead, you take the ratio of log a and log b (log a/log b), what you get is the log of the (log b)th root of a, which is not remotely what you're looking for. So, if you want to express ratios of activities of treated and control cells, what you should do is subtract the (logarithmic) channel value of the median of the controls from the (logarithmic) channel value of the median of the treated cells. Or, convert the median channel values to linear and take the ratio. [A digression: If you need a ratio on a cell by cell basis, e.g., for a pH, calcium, or membrane potential measurement, you may need to add a constant to subtracted values to keep data positive so the computed distribution fits on the histogram. See Novo et al, Cytometry 35:55-63, 1999.] There is, however, another fundamental problem here. Let's go to our familiar 4-decade log scale; I'll assume the linear values run from 1 to 10,000. Now, consider a control distribution with a median at the halfway point of the bottom decade (the linear value is the square root of ten, rounded to 3) and a treated cell distribution with a median at the halfway point of the top decade (the linear value is 3,162). The ratio of treated median to control median is 1,000. However, we are dividing by a very small number, and a slight shift in the control median (likely to occur since the bottom decade is much more likely to be affected by noise than the higher decades) can produce a very large shift in the ratio. If, instead, we subtract the linear value of the control median from the linear value of the treated cell median, we hardly notice - the original treated cell value is 3,162; the corrected value is 3,159, essentially identical (especially with the high CV's of biological data). What we are after here is statistical robustness, i.e., measures which are minimally susceptible to effects of outliers, roundoff, small changes in experimental conditions. Medians are well known to be robust, and the difference between medians will be more robust than the ratio, particularly when the ratio becomes very large. If you feel there is an overwhelming argument in favor of using a ratio, use one, but DON'T MAKE THE MISTAKE OF TAKING THE RATIO OF TWO VALUES ON A LOGARITHMIC SCALE! I'd much prefer to use a difference; furthermore, if the control values don't wander all over the lot, and there is at least a 1 1/2 log difference between the treated and control medians, you might as well not even correct by subtraction; you won't change the raw value by more than a few per cent if you subtract. -Howard
This archive was generated by hypermail 2b29 : Wed Apr 03 2002 - 11:57:31 EST