Re: Contour plots and data display.

From: Ray Hicks (rayh@fcspress.com)
Date: Fri Oct 19 2001 - 04:00:26 EST
Next message: Paul Weiss: "Salary Survey Results"
Previous message: Ray Hicks: "Example of Bad Flow Data meets Contour plots and data display"
Reply: Mario Roederer: "Re: Contour plots and data display."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
OK Mario,

In a reply to Howard Shapiro in September 97, you wrote:

>I must take issue with this.  Contour plots (when properly computed) do not
>inaccurately display bivariate data.  In fact, they can be much more
>informative than even color (or gray-scale) dot plots, which are much more
>difficult for most people to readily interpret.  It's not difficult to make up
>a series of test plots, shown in both formats, and demonstrate that the
>inexperienced person will more readily estimate the population frequency in a
>contour plot than a color-dot plot.  This, ultimately, is the goal in graphical
>presentation of data.

In another reply around that time:
>Contour plots are also an excellent way to convey density information.  When an
>appropriate contouring algorithm is used, such as "probability contours", then
>readers can readily guess at population frequencies, even if they are not
>experienced analyzers.  The downside of contours is that you miss rare
>populations that fall outside the last contour line.  Of course, this is easily
>remedied by displaying outlying events as dots, combining the best attributes
>of contour plots (density estimation) with dot plots (labelling every event).

and in reply to your attempt to close the thread(s) at the time, I wrote:

>I use dotplots to give me an idea of where populations lie, and I get the
>computer to count events in the regions that I set.  I don't feel that I
>need to act as a densitometer, so I don't fiddle around getting the
>"correct" contour set-up.  Nonetheless I find that I'm not more than 5-10%
>(of estimated percentage) out on most estimates that I'm asked to make at
>run time for populations in the range of 5% to 50% of total.  The proximity
>of similar dots gives an impression of increasing density, that works not
>unlike halftoning used in the print industry for representing grey scales.

So you'll remember that I'm even more reliant on tabulated data than you
appear to be, but from your paraphrase, you missed my point in the email
that you've replied to now:

> First of all, Ray makes the common mistake that bivariate plots (or
> even univariate histograms) should be used to estimate absolute cell
> counts.  This is nonsense.  If you want to know how many events are
> in a graph or a particular region, then you should express it in a
> table (or an annotation on the graphic).  One does NOT use graphs to
> tell people how many events are in a population.  For example, in how
> many publications have you counted the dots in the entire graph to
> know exactly how many events were collected?
>
> We use graphs such as bivariate plots and histograms to illustrate
> the patterns of antigen expression (or whatever else is being
> measured, like scatter, DNA, etc.).  We use them to estimate relative
> event frequencies.  (Note the term "frequency", not "count").  No
> reader or user is ever expected to determine cell counts by looking
> at a dot plot or a contour graph.

I suggested that more information was needed to accompany a contour plot
than even a 1-D histogram (whereas you asserted that less information should
accompany a histogram because contour plots don't need it), the presentation
of a smoothed contour plot along with only percentage information can be
quite misleading:
The point I tried, unsuccessfully, to make is that using such contour plots,
the viewer has no idea how much data is represented; even one cell can give
what appears to be a well defined "peak" contoured beautifully around it,
more cells give more such peaks, unless they're close enough together for
the contours to merge.  So while it may be possible to get an impression of
the relative frequencies from a contour plot (with which I don't agree
fully, unless the viewer is aware of the contouring method, and practised in
its interpretation - as I illustrated using the log example in my previous
mail), the smoothness can bestow undue confidence in any statistic based on
frequency generated from the data in the graph.  Say a population of 51
cells out of a sample of 500,000 makes it through a six parameter gating
strategy, and seven of those cells cluster in one corner of a plot - my
calculator tells me it's 13.725490196 percent, let's be "reasonable" and
give three decimal places, that's 13.726%.  So a viewer could be presented
with a figure showing a smooth plot, that they might (truthfully) have been
told was derived from 500,000 cells, with a table that says that there's a
population with a frequency of 13.726% they've got no way of knowing how
reliable that number is on the face of it. Given a dotplot, they'd have a
fair idea of how much confidence they should put in the value (in fact they
could count the dots themselves and look up confidence levels!).  With the
numbers quoted, at a 95% level, you'd expect the actual percentage to be
within 9% (ie 4% to 22%), if the population were 5000 cells, you'd expect
the actual percentage to be within 1% either side of 13%. The link you gave
doesn't seem to be active, but here's one to three plots of 231 cells using
Flowjo's default smoothed contour plot, unsmoothed contour plot and dotplot.

http://www.fcspress.com/overconfident.GIF

(I would have called it "overconfident?.gif", but my server's a bit picky)

The boxed off population represents 19 cells, the calculated percentage is
8.23  The smoothed contour doesn't reflect the paucity of data, the
unsmoothed contour plot doesn't reflect the distribution of data, the dot
plot does both, and would suggest that the accuracy of the quoted percentage
is questionable.  The calculated confidence interval at a level of 95% for
this example is around 3.5 (which means that 95% of the time you'd expect
the measured percentage to be between 4.73 and 11.73).


> Finally, the last comment by Ray is a veiled criticism of FlowJo, but
> presents me the opportunity to take a series of spectacular cheap
> shots.  I've never been able to resist that.

I'll have to point my parting shots more accurately, or wait until I've got
both feet in the stirrup.  You're right, users could reasonably expect to
have the linear gain used in scaling the axis.

> What if users change the gain between samples?
> Aren't your customers entitled to the correct statistics?

Yes they are, luckily they'll probably know that they've changed the gain,
and can do the arithmetic themselves, until I've had a chance to amend
FCSPress so that it works in this more user-friendly fashion. The analysis
software I've used (mainly BD) has ignored the scaling information, and left
it to the user to do the arithmetic afterwards (if they've changed this in
CellQuest I didn't notice), it didn't occur to me to make use of it. Thank
you for pointing this out.

>
> As for the event above the red line--you are absolutely correct: this
> event is, without a doubt, in fact, above the red line.  I will not
> dispute that.  Unfortunately for you, the red line happens to be
> irrelevant.  I'm not sure why you put the red line in what appears to
> be an arbitrary location in the graph!  The data for this channel is
> scaled over 3.84 decades (we calibrated our log amps, rather than
> assuming that they are 4.0), with the first channel value being 0.2.
> This means that the value at the very top of the axis is 1,383, which
> doesn't deserve its own tick mark.  So, in fact, you drew your red
> line not at the top scale of the graphic, but rather in the middle of
> where the data can still fall.  Note that FlowJo scales its graphics
> so that they fill all of the interior of the box (except for a 1
> pixel border).

As a technical aside, I notice that all four log amps used for that
acquisition have exactly the same offset (0.2000) and number of decades
(3.84). When you say calibrated, did you set them all to have the same
offset and range, or did they all just happen to all have identical
responses, or did you try out a few and select the four that matched
perfectly?

>
> Therefore, it comes as no surprise that later versions of FlowJo show
> exactly the same (quite correct) graph.  Does FCSPress show its
> graphs with more space than is needed by the limits of the data?
> I've never been a fan of useless white space.

No! white space! yeuch!  Except for kinetic plots of course, where users may
want to compare data from runs of different duration, on graphs of the same
size, in which case a bit of padding at the end makes it easier. Doesn't
FlowJo do this?


> So, no, there's nothing freaky about these axes; they are just solid,
> proper, and believable displays.
Good, thanks for clearing that up.
> mr

Ray

Mario, I copied your reply in full below in case you forgot to cc it to the
list:

> Ray's discussion below illustrates how common the myths about data
> presentation are...  and also illustrates one of the sources of the
> difficulties users have: Specifically, that much of the software out there
> doesn't do a good enough job of expressing flow data (ok, not FACS data) in a
> way doesn't allow users to make common mistakes.  (Yes, this is true of all
> software, "even" FlowJo).
>
> First of all, Ray makes the common mistake that bivariate plots (or even
> univariate histograms) should be used to estimate absolute cell counts.  This
> is nonsense.  If you want to know how many events are in a graph or a
> particular region, then you should express it in a table (or an annotation on
> the graphic).  One does NOT use graphs to tell people how many events are in a
> population.  For example, in how many publications have you counted the dots
> in the entire graph to know exactly how many events were collected?
>
> We use graphs such as bivariate plots and histograms to illustrate the
> patterns of antigen expression (or whatever else is being measured, like
> scatter, DNA, etc.).  We use them to estimate relative event frequencies.
> (Note the term "frequency", not "count").  No reader or user is ever expected
> to determine cell counts by looking at a dot plot or a contour graph.
>
> Once we understand that bivariate plots are to be used for the purposes of
> evaluating distributions (patterns), then it becomes obvious that one would
> like displays that do NOT change with event number.  After all, if I collect
> 10,000 events of one sample, and 100,000 events of another stained in exactly
> the same way, then I want to be able to compare them directly to see if the
> distributions are different.  If my displays change according to how many
> events there are, then I cannot accomplish this.  This is especially difficult
> when I compare, for example, clinical specimens where there was no choice but
> to collect different numbers of events (limited sample).  I wouldn't want my
> comparison between two specimens to be affected by the fact that one had more
> events collected than the other.
>
> This is the principal problem with dot plots--the display changes dramatically
> as more and more events are collected.  In fact, this "feature" of dot plots
> leads to a common confusions in analyzing data in other people's publications.
> For example, sometimes fewer events are collected, and then it looks like the
> "double positive" population is much more rare because there's far fewer dots
> out there.  This sort of disingenuous data presentation is made possible by
> dot plots.
>
> Ray asserts that contour plots are misleading.  Quite the opposite is true,
> especially when good density estimation algorithms are used.  An illustration
> of this can be seen at: <http://www.drmr.com/DataDisplay.html>.  Going to that
> link, you will see why contour plots (or similar types of displays) are so
> powerful:  they can be done so that they are NOT affected by cell counts and
> CAN be used to compare samples with disparate cell counts. By the way, this
> graphic illustrates that contour smoothing can be a good thing, when it's done
> properly.  Many smoothing algorithms are rightfully derided by users, because
> they can be repeatedly applied and thereby change the visualization so
> dramatically.  Many years ago, Wayne Moore and Dave Parks developed a
> relatively simple yet powerful density estimation smoothing algorithm that has
> no user control and works exceedingly well--as illustrated by the displays in
> this link.  Variations of this algorithm are used in CellQuest as well as
> statistics packages such as SAS and JMP (which have nothing to do with flow
> cytometry).
>
> In Ray's email, he also asks about log contour plots. Rather than going into
> that discussion here, there are several excellent publications written by Dave
> Parks about data display, that describe when log contour plots might be
> useful.  Given the option of doing outlier contour plots, log contour plots
> aren't as useful as they were in the early days.
>
> With FCSPress, Ray tries to get around the "saturation" problem of dot plots
> by dithering the display.  While an admirable attempt, this does not solve the
> problem.  First, it only delays the problem--as you collect more and more
> events, you will still reach saturation.  Second, while showing up nicely on
> large pieces of paper, the limited amount of display space in a journal makes
> it completely irrelevant.  The typical graphic is about 1 inch in a journal.
> At current display print resolutions of 300-600 DPI, dithering at more than
> 600 pixels per graphic becomes useless.  600 pixels per graphic is only about
> 5 times the information content of the typical 256 pixels per graphic, meaning
> that equivalent saturation is achieved with merely 5 times as many events as
> "undithered" displays.
>
> The other issue I take is; how is the collective going to select the experts?
>
> This, of course, is the heart of the problem.  Here the real solution is to
> generate a document that has requirements/suggestions/etc. for presentation,
> along with justifications for each one.  This document has to be open for
> review and comment by the community in order to develop a consensus. The
> problem is generating a consensus.  As proof, I'll bet that many people are
> already objecting to my dissertation above.  The question is, are those
> objections based on gut instinct or on sound mathematical principles?  That's
> the point of having a document that can be analyzed, reviewed, and criticized.
>
>
> Finally, the last comment by Ray is a veiled criticism of FlowJo, but presents
> me the opportunity to take a series of spectacular cheap shots.  I've never
> been able to resist that.
>
> ps as an aside, there's something freaky happening on the axes of these
> graphs - they're 512 channel data, but the linear FSC axis runs out just
> past 200, and one of the events exceeds the maximum for side scatter (ie the
> one that juumps above the red line in the left hand plots - has this been
> fixed in later versions of FlowJo?
>
> Hm.  Well, let's think about what the numbers on the axes are, first of all.
> Are they "channel numbers"? Nope--because if we put channel numbers on the
> axis then we could never tell if it was a log scale or not.  What we put on
> the axis are "scale values".  Ray, you've sent many emails to this list to
> explain how to convert between channel number and scale fluorescence for log
> channels--it's even easier for linear channels: you just divide by the gain!
> In other words, if you're instrument is set to a gain of "5", then divide the
> channel number by 5 to get fluorescence.  That way, if you collect a sample at
> two different gain settings, and do statistics on the sample, you will get the
> same fluorescence value (even though the channel values are different).
>
> So you are right that the data have 512 channels, but for this sample, the
> gain was set to 2.5 (and the $P1G keyword value is 2.5). By the FCS standard,
> this means the fluorescence at the right edge is (512/2.5), or about 206.
> Thus, the FS axis runs out just past 200.  Since you're so surprised about
> this,  I am guessing FCSPress doesn't recognize the $PnG keyword?  Wouldn't
> that be a bug users should be concerned about?  What if users change the gain
> between samples?  Aren't your customers entitled to the correct statistics?
>
> As for the event above the red line--you are absolutely correct: this event
> is, without a doubt, in fact, above the red line.  I will not dispute that.
> Unfortunately for you, the red line happens to be irrelevant.  I'm not sure
> why you put the red line in what appears to be an arbitrary location in the
> graph!  The data for this channel is scaled over 3.84 decades (we calibrated
> our log amps, rather than assuming that they are 4.0), with the first channel
> value being 0.2.  This means that the value at the very top of the axis is
> 1,383, which doesn't deserve its own tick mark. So, in fact, you drew your red
> line not at the top scale of the graphic, but rather in the middle of where
> the data can still fall. Note that FlowJo scales its graphics so that they
> fill all of the interior of the box (except for a 1 pixel border).
>
> Therefore, it comes as no surprise that later versions of FlowJo show exactly
> the same (quite correct) graph.  Does FCSPress show its graphs with more space
> than is needed by the limits of the data? I've never been a fan of useless
> white space.
>
> So, no, there's nothing freaky about these axes; they are just solid, proper,
> and believable displays.
>
> mr
>
Next message: Paul Weiss: "Salary Survey Results"
Previous message: Ray Hicks: "Example of Bad Flow Data meets Contour plots and data display"
Reply: Mario Roederer: "Re: Contour plots and data display."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This archive was generated by hypermail 2b29 : Sun Jan 05 2003 - 19:01:35 EST