Changeset 12326


Ignore:
Timestamp:
07/04/2009 01:44:12 PM (14 months ago)
Author:
hogg
Message:

fitting a line: the discussion following the mixture model is still a mess, but I am making progress

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/documents/notes/straightline/straightline.tex

    r12319 r12326  
    604604an analysis of your prior knowledge or else be uninformative. 
    605605 
     606The posterior probability distribution function, in general, is the 
     607likelihood times the prior, properly normalized.  In this case, the 
     608posterior we (tend to) care about is that for the line parameters 
     609$(m,b)$.  This is the full posterior probability distribution 
     610marginalized over the bad-data parameters $(\allq,\Pbad,\Ybad,\Vbad)$. 
     611The full posterior is 
     612\begin{equation} 
     613p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) = 
     614 \frac{p(\ally|m,b,\allq,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} 
     615 \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|I) 
     616 \quad, 
     617\end{equation} 
     618where the numerator can---for our purposes here---be thought of as a 
     619normalization integral that makes the posterior distribution function 
     620integrate to unity.  The marginalization looks like 
     621\begin{equation} 
     622p(m,b|\ally,I)=\int \d\allq\,\d\Pbad\,\d\Ybad\,\d\Vbad 
     623 \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) \quad, 
     624\end{equation} 
     625where by the integral over $\d\allq$ we really mean a sum over all of 
     626the $2^N$ possible settings of the $q_i$, and the integrals over the 
     627other parameters are over their entire prior-probable domains. 
     628Effectively, then this marginalization involves evaluating $2^N$ 
     629different likelihoods and marginalizing each one of them over 
     630$(\Pbad,\Ybad,\Vbad)$!  Of course because very few points are true 
     631candidates for rejection, there are many ways to ``trim the tree'' and 
     632do this more quickly; the lazy can simply sample; the very lazy can 
     633loop over all $2^N$ and go on vacation while it executes. 
     634   
     635Compared to the standard line-fitting of 
     636\sectionname~\ref{sec:standard}, once we imagine rejecting data, we 
     637are going to have to put some thought into computational methodology; 
     638the sum over $2^N$ settings should give any reader pause.  We will 
     639give, below, some advice about performing the marginalization and 
     640computing the posterior distribution functions, which can be extremely 
     641complex.  We will also say a few things about Bayesian 
     642methods---involving priors and marginalization---in general.  But 
     643first we will present an alternative to this exponentially complex 
     644solution to this problem. 
     645 
     646There is a much faster---meaning ``less computationally 
     647complex''---method for modeling the outliers: Instead of requiring a 
     648hard assignment $q_i$ for each data point to either the straight-line 
     649or outlier populations, every data point can be modeled as being drawn 
     650from a \emph{mixture} (sum with amplitudes that sum to unity) of the 
     651straight-line and outlier distributions.  That is, the generative 
     652model the data $\ally$ is changed to 
     653\begin{eqnarray}\label{eq:mixture}\displaystyle 
     654\like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} 
     655 \nonumber\\ 
     656\like &\propto& 
     657 \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}} 
     658 \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right. 
     659 \nonumber \\ & & \quad 
     660 \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}} 
     661 \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right] 
     662 \quad , 
     663\end{eqnarray} 
     664where, once again, $\Pbad$ is the probability that a data point is bad 
     665(or, more properly, the amplitude of the bad-data distribution 
     666function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of 
     667the bad-data distribution. 
     668 
     669Marginalization of the posterior generated by the likelihood in 
     670\equationname~(\ref{eq:mixture}) does \emph{not} involve exploring an 
     671exponential number of alternatives, and, again, when combined with a prior 
     672on $(m,b,\Pbad,\Ybad,\Vbad)$, can be marginalized over only three 
     673parameters to produce the posterior probability distribution function 
     674for the straight-line parameters $(m,b)$.  This scale of problem is 
     675very well-suited to simple multi-dimensonal integrators, in particular 
     676it works very well with a Markov-Chain Monte Carlo, which we advocate 
     677below. 
     678 
     679Whether to use the exponential binary assignment or mixture methods 
     680depends---in principle---on your generative model for the data: Are 
     681there really ``good'' and ``bad'' points, drawn from different 
     682distributions, or are all points equivalent and drawn from a single 
     683distribution with good and bad parts to it?  However, this question 
     684can rarely be answered (or even posed well)---in practice---in most 
     685problems of interest, so we strongly advise the latter, that is, 
     686fitting the distribution as a mixture of good and bad distributions, 
     687where there is no exponential complexity. 
     688 
     689% syntactical reference point 
     690 
     691Hogg...in both we have to choose priors on $(m,b,\Pbad,\Ybad,\Vbad)$...Hogg 
     692 
    606693Hogg...we must discuss optimization and marginalization...Hogg 
    607694 
     
    612699necessary in all your inference.  Most data analysis platforms have 
    613700optimizers that can handle problems like those presented here with 
    614 little or no trouble.  The second piece of advice is to spend a bit of 
    615 thought on initialization.  That is, do not start the optimizer in an 
    616 arbitrary point in parameter space, but rather initialize it near 
    617 where you think the answer is likely to be, using either your prior 
    618 knowledge, or else a linear fit (which provides an objective starting 
    619 point).  The third piece of advice is to check for multiple minima by 
    620 randomly sampling parameter space in some significant region around 
    621 the optimum you have, re-converge the nonlinear optimizer, and verify 
    622 that you return to the same best optimum, or that if you come to a 
    623 different one, it has a worse value for the objective. 
     701little or no trouble.  The second piece of advice---especially if you 
     702use a continuous optimizer, such as gradient descent---is to spend a 
     703bit of thought on initialization.  That is, do not start the optimizer 
     704in an arbitrary point in parameter space, but rather initialize it 
     705near where you think the answer is likely to be, using either your 
     706prior knowledge, or else a linear fit (which provides an objective 
     707starting point).  The third piece of advice is to check for multiple 
     708minima by randomly sampling parameter space in some significant region 
     709around the optimum you have, re-converge the nonlinear optimizer, and 
     710verify that you return to the same best optimum, or that if you come 
     711to a different one, it has a worse value for the objective. 
    624712 
    625713One attractive kind of optimizer is a sampler such as a Markov Chain 
     
    645733that optimizes the scalar objective.  This method doesn't return the 
    646734strict optimum, but it does very well in most circumstances. 
    647  
    648 The posterior probability distribution is the (correctly normalized) 
    649 product of the likelihood times the prior.  If all we care about are 
    650 the parameters $(m,b)$ of the line, we have to marginalize the 
    651 posterior probability distribution over all choices for the binary 
    652 integers $\allq$, all values of the prior probability $\Pbad$, and the 
    653 parameter space $(\Ybad,\Vbad)$ of the bad-point distribution.  This 
    654 marginalization involves evaluating $2^N$ different likelihoods and 
    655 marginalizing each one of them over $(\Pbad,\Ybad,\Vbad)$!  Of course 
    656 because very few points are true candidates for rejection, there are 
    657 many ways to ``trim the tree'' and do this more quickly; the lazy can 
    658 simply sample; the very lazy can loop over all $2^N$ and go on 
    659 vacation while it executes.  In any case, this bad \emph{scaling} is a 
    660 strong mark against this method, and we will give a much faster 
    661 alternative below. 
    662735 
    663736One unfortunate (though unavoidably correct) thing about any 
     
    696769or standing in as a \emph{prior} for some subsequent inference. 
    697770 
    698 We promised a faster---less computationally complex---method for 
    699 modeling the outliers; here it is: Instead of requiring a hard 
    700 assignment $q_i$ for each data point to either the straight-line or 
    701 outlier populations, every data point can be modeled as being drawn 
    702 from a \emph{mixture} (sum with amplitudes that sum to unity) of the 
    703 straight-line and outlier distributions.  That is, the generative 
    704 model the data $\ally$ is changed to 
    705 \begin{eqnarray}\label{eq:mixture}\displaystyle 
    706 \like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} 
    707  \nonumber\\ 
    708 \like &\propto& 
    709  \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}} 
    710  \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right. 
    711  \nonumber \\ & & \quad 
    712  \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}} 
    713  \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right] 
    714  \quad , 
    715 \end{eqnarray} 
    716 where, once again, $\Pbad$ is the probability that a data point is bad 
    717 (or, more properly, the amplitude of the bad-data distribution 
    718 function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of 
    719 the bad-data distribution. 
    720  
    721 Optimization of the likelihood in \equationname~(\ref{eq:mixture}) 
    722 does \emph{not} involve exploring an exponential number of 
    723 alternatives, and---when combined with a prior on 
    724 $(m,b,\Pbad,\Ybad,\Vbad)$---can be marginalized over three rather than 
    725 $N+3$ parameters to produce a posterior for the straight-line 
    726 parameters $(m,b)$.  This scale of problem is very easy to solve with 
    727 any simple optimizer, and in particular works very well with a 
    728 Markov-Chain. 
    729  
    730 Whether to use the binary assignment or mixture methods depends, in 
    731 principle, on your generative model for the data: Are there really 
    732 ``good'' and ``bad'' points, drawn from different distributions, or 
    733 are all points equivalent and drawn from a single distribution with 
    734 good and bad parts to it?  However, this question can rarely be 
    735 answered (or even posed well) in most problems of interest, so we 
    736 strongly advise the latter, that is, fitting the distribution as a 
    737 mixture of good and bad distributions, where there is no exponential 
    738 complexity. 
    739  
    740771\begin{comments} 
    741772Hogg...marginalization of posteriors vs likelihoods and why, 
    742773technically you have to be a Bayesian for this...Hogg 
     774 
     775Hogg...everyone gets in a tizzy about priors as assumptions, but they 
     776shouldn't...Hogg 
    743777 
    744778When flagging bad data, you might not want to give all points the same 
     
    811845$\Pbad,\Ybad,\Vbad$) and plot the samples as a set of light grey or 
    812846transparent lines. 
     847\end{problem} 
     848 
     849\begin{problem}\label{prob:badfraction} 
     850Solve \problemname~\ref{prob:mixture} but now plot the fully 
     851marginalized (over $m,b,\Ybad,\Vbad$) posterior distribution function 
     852for parameter $\Pbad$.  Is this distribution peaked about where you 
     853would expect, given the data?  Now repeat the problem, but dividing 
     854all the data uncertainty variances $\sigma_{yi}^2$ by 4 (or dividing 
     855the uncertainties by 2).  Again plot the fully marginalized posterior 
     856distribution function for parameter $\Pbad$.  Discuss. 
    813857\end{problem} 
    814858 
Note: See TracChangeset for help on using the changeset viewer.