Re: Ratio or Mean

From: Howard Shapiro (hms@shapirolab.com)
Date: Mon Mar 26 2001 - 19:04:24 EST


Tomas Corcoran wrote-

>I wonder whether anyone is willing to put their neck on the line with this
>apparently innocuous question.
>
>It relates to the ongoing discussion regarding whether the mean ,
>geometric mean or median should be used.
>
>When measuring neutrophil respiratory burst or phagocytic activity , using
>Phagoburst from pharmigen , should you use the absolute difference between
>geometric means ( means , medians ) of the control versus the activated
>samples , or should you use the ratios. We were mostly using the
>differences or ratios of the geometric means but there is a fierce debate
>ongoing in our laboratory regarding this issue and preliminary results
>from comparative studies show the use of ratios of geometric means coming
>out on top.


David Coder, in a recent posting worth reading, described the geometric
mean as the 'correct' statistic for lognormal distributed data.  I don't
know that I would be that charitable; the software gives you the geometric
mean because it is easy to compute, i.e., it doesn't require transformation
of the data between log and linear space, while computing the real mean
would.  If you have linear data, you get the mean by taking the sum of
(channel number * number of events in channel) over the range and divide by
the total number of events.  Since in logarithmic space, addition and
subtraction correspond to multiplication and division, performing the same
operation on log scale data gives you a quantity which is the log of the
nth root of the product of all the n data points; this is the geometric
mean.  It is, as I say, simple to compute; taking the arithmetic mean,
which is the mean we're all used to, would require transforming all the
logarithmic values into linear ones, which tends to be inaccurate when log
data are on a 256-channel or 1024-channel scale.

If you are actually trying to compare flow data with a bulk assay of some
kind - for example, you have determined the total amount of fluorescent
label in a solution of 100,000 cells, and you now want to calibrate the
flow cytometric fluorescence histogram in terms of molecules of label per
channel - you do need to use the arithmetic mean, as Alice Givan recently
pointed out, and you therefore need linear data, while you usually have log
data.  That is why most people trying to do quantitative immunofluorescence
use beads bearing known amounts of label or antibody instead of trying to
get back to fluorescence measurements in solution (which are, of course,
required to calibrate the beads, but that's the bead manufacturers'
headache, not the end users').

In respiratory burst or phagocytosis assays, what you want to know is how
much more activity than control cells is to be found in your treated
populations.  You really need to subtract the linear values to get this
number, but you don't need the means; the median is a better statistic
here, because it is much less subject to the influence of outliers than is
the mean.

TAKING THE RATIO OF TWO QUANTITIES ON A LOG SCALE IS SO STUPID THAT ANYONE
INCLUDING SUCH A CALCULATION IN A PAPER SUBMITTED TO A JOURNAL SHOULD BE
BANNED FOR A YEAR FROM SUBMITTING ANOTHER PAPER - but, lucky for so many
people, most of the reviewers and editors, even those associated with some
really toney journals, are blissfully unaware just how stupid it is.

I assume people do this because it seems easy; they figure that, if they
don't have linear data, and can't readily subtract the control value from
the experimental one, they can take the ratio of the experimental and
control values, and effectively make the statement that the treated cells
exhibit x times as much activity as the controls, rather than q units more
activity than the controls.  That aim in itself is not
illegitimate.  However, when you're working with data on a log scale, you
get the ratio of activities by subtraction, not by division, since log
(a/b) = log a - log b.  If instead, you take the ratio of log a and log b
(log a/log b), what you get is the log of the (log b)th root of a, which is
not remotely what you're looking for.  So, if you want to express ratios of
activities of treated and control cells, what you should do is subtract the
(logarithmic) channel value of the median of the controls from the
(logarithmic) channel value of the median of the treated cells.  Or,
convert the median channel values to linear and take the ratio.

[A digression: If you need a ratio on a cell by cell basis, e.g., for a pH,
calcium, or membrane potential measurement, you may need to add a constant
to subtracted values to keep data positive so the computed distribution
fits on the histogram.  See Novo et al, Cytometry 35:55-63, 1999.]

There is, however, another fundamental problem here.  Let's go to our
familiar 4-decade log scale; I'll assume the linear values run from 1 to
10,000.  Now, consider a control distribution with a median at the halfway
point of the bottom decade (the linear value is the square root of ten,
rounded to 3) and a treated cell distribution with a median at the halfway
point of the top decade (the linear value is 3,162). The ratio of treated
median to control median is 1,000.  However, we are dividing by a very
small number, and a slight shift in the control median (likely to occur
since the bottom decade is much more likely to be affected by noise than
the higher decades) can produce a very large shift in the ratio.  If,
instead, we subtract the linear value of the control median from the linear
value of the treated cell median, we hardly notice - the original treated
cell value is 3,162; the corrected value is 3,159, essentially identical
(especially with the high CV's of biological data).

What we are after here is statistical robustness, i.e., measures which are
minimally susceptible to effects of outliers, roundoff, small changes in
experimental conditions.  Medians are well known to be robust, and the
difference between medians will be more robust than the ratio, particularly
when the ratio becomes very large.  If you feel there is an overwhelming
argument in favor of using a ratio, use one, but DON'T MAKE THE MISTAKE OF
TAKING THE RATIO OF TWO VALUES ON A LOGARITHMIC SCALE!  I'd much prefer to
use a difference; furthermore, if the control values don't wander all over
the lot, and there is at least a 1 1/2 log difference between the treated
and control medians, you might as well not even correct by subtraction; you
won't change the raw value by more than a few per cent if you subtract.

-Howard



This archive was generated by hypermail 2b29 : Wed Apr 03 2002 - 11:57:31 EST