Changeset 12326
- Timestamp:
- 07/04/2009 01:44:12 PM (14 months ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/documents/notes/straightline/straightline.tex
r12319 r12326 604 604 an analysis of your prior knowledge or else be uninformative. 605 605 606 The posterior probability distribution function, in general, is the 607 likelihood times the prior, properly normalized. In this case, the 608 posterior we (tend to) care about is that for the line parameters 609 $(m,b)$. This is the full posterior probability distribution 610 marginalized over the bad-data parameters $(\allq,\Pbad,\Ybad,\Vbad)$. 611 The full posterior is 612 \begin{equation} 613 p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) = 614 \frac{p(\ally|m,b,\allq,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} 615 \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|I) 616 \quad, 617 \end{equation} 618 where the numerator can---for our purposes here---be thought of as a 619 normalization integral that makes the posterior distribution function 620 integrate to unity. The marginalization looks like 621 \begin{equation} 622 p(m,b|\ally,I)=\int \d\allq\,\d\Pbad\,\d\Ybad\,\d\Vbad 623 \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) \quad, 624 \end{equation} 625 where by the integral over $\d\allq$ we really mean a sum over all of 626 the $2^N$ possible settings of the $q_i$, and the integrals over the 627 other parameters are over their entire prior-probable domains. 628 Effectively, then this marginalization involves evaluating $2^N$ 629 different likelihoods and marginalizing each one of them over 630 $(\Pbad,\Ybad,\Vbad)$! Of course because very few points are true 631 candidates for rejection, there are many ways to ``trim the tree'' and 632 do this more quickly; the lazy can simply sample; the very lazy can 633 loop over all $2^N$ and go on vacation while it executes. 634 635 Compared to the standard line-fitting of 636 \sectionname~\ref{sec:standard}, once we imagine rejecting data, we 637 are going to have to put some thought into computational methodology; 638 the sum over $2^N$ settings should give any reader pause. We will 639 give, below, some advice about performing the marginalization and 640 computing the posterior distribution functions, which can be extremely 641 complex. We will also say a few things about Bayesian 642 methods---involving priors and marginalization---in general. But 643 first we will present an alternative to this exponentially complex 644 solution to this problem. 645 646 There is a much faster---meaning ``less computationally 647 complex''---method for modeling the outliers: Instead of requiring a 648 hard assignment $q_i$ for each data point to either the straight-line 649 or outlier populations, every data point can be modeled as being drawn 650 from a \emph{mixture} (sum with amplitudes that sum to unity) of the 651 straight-line and outlier distributions. That is, the generative 652 model the data $\ally$ is changed to 653 \begin{eqnarray}\label{eq:mixture}\displaystyle 654 \like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} 655 \nonumber\\ 656 \like &\propto& 657 \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}} 658 \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right. 659 \nonumber \\ & & \quad 660 \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}} 661 \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right] 662 \quad , 663 \end{eqnarray} 664 where, once again, $\Pbad$ is the probability that a data point is bad 665 (or, more properly, the amplitude of the bad-data distribution 666 function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of 667 the bad-data distribution. 668 669 Marginalization of the posterior generated by the likelihood in 670 \equationname~(\ref{eq:mixture}) does \emph{not} involve exploring an 671 exponential number of alternatives, and, again, when combined with a prior 672 on $(m,b,\Pbad,\Ybad,\Vbad)$, can be marginalized over only three 673 parameters to produce the posterior probability distribution function 674 for the straight-line parameters $(m,b)$. This scale of problem is 675 very well-suited to simple multi-dimensonal integrators, in particular 676 it works very well with a Markov-Chain Monte Carlo, which we advocate 677 below. 678 679 Whether to use the exponential binary assignment or mixture methods 680 depends---in principle---on your generative model for the data: Are 681 there really ``good'' and ``bad'' points, drawn from different 682 distributions, or are all points equivalent and drawn from a single 683 distribution with good and bad parts to it? However, this question 684 can rarely be answered (or even posed well)---in practice---in most 685 problems of interest, so we strongly advise the latter, that is, 686 fitting the distribution as a mixture of good and bad distributions, 687 where there is no exponential complexity. 688 689 % syntactical reference point 690 691 Hogg...in both we have to choose priors on $(m,b,\Pbad,\Ybad,\Vbad)$...Hogg 692 606 693 Hogg...we must discuss optimization and marginalization...Hogg 607 694 … … 612 699 necessary in all your inference. Most data analysis platforms have 613 700 optimizers that can handle problems like those presented here with 614 little or no trouble. The second piece of advice is to spend a bit of 615 thought on initialization. That is, do not start the optimizer in an 616 arbitrary point in parameter space, but rather initialize it near 617 where you think the answer is likely to be, using either your prior 618 knowledge, or else a linear fit (which provides an objective starting 619 point). The third piece of advice is to check for multiple minima by 620 randomly sampling parameter space in some significant region around 621 the optimum you have, re-converge the nonlinear optimizer, and verify 622 that you return to the same best optimum, or that if you come to a 623 different one, it has a worse value for the objective. 701 little or no trouble. The second piece of advice---especially if you 702 use a continuous optimizer, such as gradient descent---is to spend a 703 bit of thought on initialization. That is, do not start the optimizer 704 in an arbitrary point in parameter space, but rather initialize it 705 near where you think the answer is likely to be, using either your 706 prior knowledge, or else a linear fit (which provides an objective 707 starting point). The third piece of advice is to check for multiple 708 minima by randomly sampling parameter space in some significant region 709 around the optimum you have, re-converge the nonlinear optimizer, and 710 verify that you return to the same best optimum, or that if you come 711 to a different one, it has a worse value for the objective. 624 712 625 713 One attractive kind of optimizer is a sampler such as a Markov Chain … … 645 733 that optimizes the scalar objective. This method doesn't return the 646 734 strict optimum, but it does very well in most circumstances. 647 648 The posterior probability distribution is the (correctly normalized)649 product of the likelihood times the prior. If all we care about are650 the parameters $(m,b)$ of the line, we have to marginalize the651 posterior probability distribution over all choices for the binary652 integers $\allq$, all values of the prior probability $\Pbad$, and the653 parameter space $(\Ybad,\Vbad)$ of the bad-point distribution. This654 marginalization involves evaluating $2^N$ different likelihoods and655 marginalizing each one of them over $(\Pbad,\Ybad,\Vbad)$! Of course656 because very few points are true candidates for rejection, there are657 many ways to ``trim the tree'' and do this more quickly; the lazy can658 simply sample; the very lazy can loop over all $2^N$ and go on659 vacation while it executes. In any case, this bad \emph{scaling} is a660 strong mark against this method, and we will give a much faster661 alternative below.662 735 663 736 One unfortunate (though unavoidably correct) thing about any … … 696 769 or standing in as a \emph{prior} for some subsequent inference. 697 770 698 We promised a faster---less computationally complex---method for699 modeling the outliers; here it is: Instead of requiring a hard700 assignment $q_i$ for each data point to either the straight-line or701 outlier populations, every data point can be modeled as being drawn702 from a \emph{mixture} (sum with amplitudes that sum to unity) of the703 straight-line and outlier distributions. That is, the generative704 model the data $\ally$ is changed to705 \begin{eqnarray}\label{eq:mixture}\displaystyle706 \like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)}707 \nonumber\\708 \like &\propto&709 \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}}710 \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right.711 \nonumber \\ & & \quad712 \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}}713 \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right]714 \quad ,715 \end{eqnarray}716 where, once again, $\Pbad$ is the probability that a data point is bad717 (or, more properly, the amplitude of the bad-data distribution718 function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of719 the bad-data distribution.720 721 Optimization of the likelihood in \equationname~(\ref{eq:mixture})722 does \emph{not} involve exploring an exponential number of723 alternatives, and---when combined with a prior on724 $(m,b,\Pbad,\Ybad,\Vbad)$---can be marginalized over three rather than725 $N+3$ parameters to produce a posterior for the straight-line726 parameters $(m,b)$. This scale of problem is very easy to solve with727 any simple optimizer, and in particular works very well with a728 Markov-Chain.729 730 Whether to use the binary assignment or mixture methods depends, in731 principle, on your generative model for the data: Are there really732 ``good'' and ``bad'' points, drawn from different distributions, or733 are all points equivalent and drawn from a single distribution with734 good and bad parts to it? However, this question can rarely be735 answered (or even posed well) in most problems of interest, so we736 strongly advise the latter, that is, fitting the distribution as a737 mixture of good and bad distributions, where there is no exponential738 complexity.739 740 771 \begin{comments} 741 772 Hogg...marginalization of posteriors vs likelihoods and why, 742 773 technically you have to be a Bayesian for this...Hogg 774 775 Hogg...everyone gets in a tizzy about priors as assumptions, but they 776 shouldn't...Hogg 743 777 744 778 When flagging bad data, you might not want to give all points the same … … 811 845 $\Pbad,\Ybad,\Vbad$) and plot the samples as a set of light grey or 812 846 transparent lines. 847 \end{problem} 848 849 \begin{problem}\label{prob:badfraction} 850 Solve \problemname~\ref{prob:mixture} but now plot the fully 851 marginalized (over $m,b,\Ybad,\Vbad$) posterior distribution function 852 for parameter $\Pbad$. Is this distribution peaked about where you 853 would expect, given the data? Now repeat the problem, but dividing 854 all the data uncertainty variances $\sigma_{yi}^2$ by 4 (or dividing 855 the uncertainties by 2). Again plot the fully marginalized posterior 856 distribution function for parameter $\Pbad$. Discuss. 813 857 \end{problem} 814 858
Note: See TracChangeset
for help on using the changeset viewer.
