Mario Roederer commented on my recent tirade: >>...the software gives you the geometric >>mean because it is easy to compute, i.e., it doesn't require transformation >>of the data between log and linear space, while computing the real mean >>would. > >Most software doesn't give you geometric mean because it's "easy", it does >so because it tends to be the most useful statistic. I found the geometric mean mentioned in only two of seven statistics books I pulled off my shelf. Only one of the two alluded to its use for characterizing data transformed to a log scale. So, unless Mario provides references for the geometric mean being "the most useful statistic", I'll stick to my contention that it's widely available because it's easy to compute. > The transformation itself is trivial and does not impact on programming > demand, and it sure doesn't slow down today's computers noticeably. Any > software package capable of software compensation clearly has the > capability of computing arithmetic means... All true; the problem is not computational speed but the relative imprecision of transforming data from log to linear and back when they have been captured on only a 256- or 1024-channel log scale. I believe the software packages capable of software compensation (the list would include FCS Express, FlowJo, WinList, and software from B-D (for their digital electronics), Beckman Coulter (for the XL), and Cytomation are in the minority, but I applaud the trend to include software copmensation in packages and I clap even harder for the trend toward high-resolution analog-to-digital conversion without which software compensation doesn't work very well. Even more important, high-resolution conversion lets us keep accurate linear data, facilitating computation of arithmetic means, medians, ratios, percentiles, etc., etc. - if we had had it when the field got started, people would not have developed the bad habits I alluded to. >>TAKING THE RATIO OF TWO QUANTITIES ON A LOG SCALE IS SO STUPID THAT ANYONE >>INCLUDING SUCH A CALCULATION IN A PAPER SUBMITTED TO A JOURNAL SHOULD BE >>BANNED FOR A YEAR FROM SUBMITTING ANOTHER PAPER - but, lucky for so many >>people, most of the reviewers and editors, even those associated with some >>really toney journals, are blissfully unaware just how stupid it is. > >It is important to clarify that what is stupid is ratioing the channel >values of a log scale--i.e., ratioing values that increase linearly on a >logarithmic fluorescence scale. There is nothing inherently wrong with >ratioing the "scale" values (those that increase exponentially). For >example, if a population has a median fluorescence of 10,000 (4th decade, >channel 1024) and another has a median fluorescence of 1,000 (3rd decade, >channel 768), then it is appropriate to note that it is 10 times as bright >as the other (not 1.33 times). I agree; that is essentially what I said later on. >We used ratios when it was appropriate: for example, when we were >measuring the fold-increase of beta-gal activity driven by a promoter >after stimulation (measured by FACS-Gal assay). The pre-stimulation >condition still expressed considerable beta-gal; when stimulated, we got >5x or 10x as much. The ratio of the median fluorescences was appropriate >because we found that the RATIO of the post- to pre-stimulation values was >conserved across different cell lines, although they had different basal >expression levels (and therefore different stimulated levels). This was >interesting scientifically--says something about the log-responsiveness of >promoters and enhancers... but I digress. Note that it is rarely correct >to ratio against the median or mean autofluorescence--rather, as you point >out, subtraction is superior for such a case. I agree again. >>If you are actually trying to compare flow data with a bulk assay of some >>kind - for example, you have determined the total amount of fluorescent >>label in a solution of 100,000 cells, and you now want to calibrate the >>flow cytometric fluorescence histogram in terms of molecules of label per >>channel - you do need to use the arithmetic mean, as Alice Givan recently >>pointed out, and you therefore need linear data, while you usually have log >>data. > >Actually, we found an very good correlation between the MEDIAN >fluorescence of a population of cultured cells expressing b-galactosidase >(measured by the FACS-Gal assay) and the total b-gal content of the >population by a biochemical assay (MUG). Of course, this was because the >populations were relatively homogeneous (clonal), with about 1-decade >range in fluorescence--for heterogeneous expressions, the median was not a >very good correlate of the biochemical activity. Yes; in a relatively homogeneous population, the median will be close to the mean. >This actually raises the most important point that everyone seems to be >dancing around but ignoring: using any statistic is good as long as you >justify (to yourself, and to the reviewers) that it is appropriate! In >other words, if your cell population is homogeneous, then nearly anything >will work. If it's not homogeneous, then you may have a lot of trouble >with any single statistic. Also true. Most scientists use only a few of the many statistical tests that have been developed; the statistical journals are full of specialized tests fine-tuned to various experimental situations. However, widely-used tests may be widely misused. >While I agree that the arithmetic mean is probably going to be the closest >for heterogeneous populations, I disagree that it should be used! The >fact that the population cannot be effectively described by the median >means that there is an interesting heterogeneity underlying the >expression--and therefore it becomes a mistake to reduce the data to a >single value. > >After all, this is where the power of flow cytometry is: in the >description of the DISTRIBUTION of expression. The fact that people >continue to take pains to reduce our gloriously rich and detailed data to >a single number pains me to no end! Much better would be to calculate the >10th, 25th, 50th (median), 75th, and 90th percentiles of a complex >distribution: at least now you have 5 parameters to the distribution and >therefore a much better chance of accurately describing it (and possibly >discovering underlying phenomena hidden by using only a single value). Mario makes a point I didn't make and with which I agree; one ought not to be comparing heterogeneous populations based on a single number. Culture, sorting, and gating are all appropriate to provide relatively homogeneous populations for comparison, and, when the populations being compared are heterogeneous, there is always the risk that observed differences between samples will be due to differences in proportions of the various component subpopulations. Calculating percentiles of distribution as Mario described is indeed a good way to provide something beyond a single number for purposes of comparison; I thought I mentioned that in the Data Analysis chapter of the 3rd Edition of Practical Flow Cytometry, but I couldn't find an explicit statement to that effect. I did deal with medians and interquartile ranges; these, like percentiles, are robust statistics, and I'll cover the subject in greater detail (and, I hope, with greater clarity) in the 4th Edition. >Here's my bottom line for log distributions: > >"If the median is an inadequate description of the distribution, then it >is inappropriate to reduce the distribution to a single value by any >algorithm." Probably a good rule of thumb. >In such a case, using the arithmetic mean, geometric mean, Mario's mean, >Howard's mean, or even God's mean The "summa theologica"? The story is told that after the Flood, God told the animals to go forth and multiply. When He checked up, He found that all had done so except for a pair of snakes, who said "we can't multiply, Lord; we're adders!" So He told them to use logarithms. > (should that actually differ from Howard's) won't be any better and is > only throwing mud onto a beautiful painting of data. I believe the current material of choice among the cognoscenti is elephant dung, not mud. And, as Abe Lincoln might have observed, putting a log over a log is a good start on a log cabin but it won't get you the ratio you're looking for. -Howard
This archive was generated by hypermail 2b29 : Wed Apr 03 2002 - 11:57:31 EST