From: A.J. Rossini (rossini@blindglobe.net)
Date: Wed Apr 16 2003 - 15:37:13 EST
Mario Roederer <roederer@drmr.com> writes:
> A couple of years ago, we published a method ("Probability Binning"),
> which provides a statistic that is far more useful (biologically) than
> that provided by the KS algorithm. The output of this method
> ("T(chi)" or T(x)) is very similar to a student's t-score--i.e., it is
> essentially the number of standard deviations away from a result
> expected solely by chance. This metric is normalized by the number of
> events, and shows much more tolerance than KS. Note that for simple
> distribution comparisons the T(x) is highly related to the KS "D"
> value.
>
> The T(x) values can be compared between samples to rank them (for
> example, how different from a control sample). The other advantage of
> the PB method is that it has been generalized to "n" dimensions--KS
> only works on one-dimensional (univariate) data; PB can compare, for
> example, multivariate staining patterns to detect subtle differences.
> Finally, the PB algorithm can be extended to not only tell you what is
> different between two samples, but to identify the regions in
> multivariate space that account for those differences--e.g., searching
> out the regions in immunophenotyping space that make two samples
> different.
>
> The PB algorithms were published in a series of articles in Cytometry
> in volume 45, pages 37-64. These algorithms have been implemented in
> FlowJo as published.
We implemented them in RFlowCyt (not really ready for prime time,
perhaps never for production, but nice for modelling and exploring
statistical approaches for flow data) and played with them last year
-- for the most part, esp if you use Keith Baggerly's suggested
corrections, they are decent. Scalability problems still arise,
especially if one is looking to make sense of inference (i.e. Keith's
paper is slightly misleading, limiting his simulations to small
datasets). I'm studying a few alternatives that we are playing with;
so far PB is "equivalent" in the sense of not being dominated or
dominating to the alternative learning approaches across a range of
situations (work in progress). So, as an exploratory statistical
approach, it (probability binning) looks pretty good. And decision
making based on it should work reasonably well for an experienced data
analyst.
The trick appears to be to move from exploratory statistics to real
inferential statistics, i.e. getting p-values that make sense, but
this is a standard problem for inference (hypothesis testing,
confidence intervals, decision making) with large datasets, and is
always a bit tricky with data-adaptive statistical procedures (which
most machine-learning procedures tend to be).
best,
-tony
--
A.J. Rossini rossini@u.washington.edu http://software.biostat.washington.edu/
Biostatistics, U Washington and Fred Hutchinson Cancer Research Center
FHCRC:Tu: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email
UW : Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX
CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be
confidential and privileged. If you received this message in error,
please destroy it and notify the sender. Thank you.
This archive was generated by hypermail 2.1.6 : Thu Jan 01 2004 - 17:43:40 EST