| | 606 | The posterior probability distribution function, in general, is the |
| | 607 | likelihood times the prior, properly normalized. In this case, the |
| | 608 | posterior we (tend to) care about is that for the line parameters |
| | 609 | $(m,b)$. This is the full posterior probability distribution |
| | 610 | marginalized over the bad-data parameters $(\allq,\Pbad,\Ybad,\Vbad)$. |
| | 611 | The full posterior is |
| | 612 | \begin{equation} |
| | 613 | p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) = |
| | 614 | \frac{p(\ally|m,b,\allq,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} |
| | 615 | \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|I) |
| | 616 | \quad, |
| | 617 | \end{equation} |
| | 618 | where the numerator can---for our purposes here---be thought of as a |
| | 619 | normalization integral that makes the posterior distribution function |
| | 620 | integrate to unity. The marginalization looks like |
| | 621 | \begin{equation} |
| | 622 | p(m,b|\ally,I)=\int \d\allq\,\d\Pbad\,\d\Ybad\,\d\Vbad |
| | 623 | \,p(m,b,\allq,\Pbad,\Ybad,\Vbad|\ally,I) \quad, |
| | 624 | \end{equation} |
| | 625 | where by the integral over $\d\allq$ we really mean a sum over all of |
| | 626 | the $2^N$ possible settings of the $q_i$, and the integrals over the |
| | 627 | other parameters are over their entire prior-probable domains. |
| | 628 | Effectively, then this marginalization involves evaluating $2^N$ |
| | 629 | different likelihoods and marginalizing each one of them over |
| | 630 | $(\Pbad,\Ybad,\Vbad)$! Of course because very few points are true |
| | 631 | candidates for rejection, there are many ways to ``trim the tree'' and |
| | 632 | do this more quickly; the lazy can simply sample; the very lazy can |
| | 633 | loop over all $2^N$ and go on vacation while it executes. |
| | 634 | |
| | 635 | Compared to the standard line-fitting of |
| | 636 | \sectionname~\ref{sec:standard}, once we imagine rejecting data, we |
| | 637 | are going to have to put some thought into computational methodology; |
| | 638 | the sum over $2^N$ settings should give any reader pause. We will |
| | 639 | give, below, some advice about performing the marginalization and |
| | 640 | computing the posterior distribution functions, which can be extremely |
| | 641 | complex. We will also say a few things about Bayesian |
| | 642 | methods---involving priors and marginalization---in general. But |
| | 643 | first we will present an alternative to this exponentially complex |
| | 644 | solution to this problem. |
| | 645 | |
| | 646 | There is a much faster---meaning ``less computationally |
| | 647 | complex''---method for modeling the outliers: Instead of requiring a |
| | 648 | hard assignment $q_i$ for each data point to either the straight-line |
| | 649 | or outlier populations, every data point can be modeled as being drawn |
| | 650 | from a \emph{mixture} (sum with amplitudes that sum to unity) of the |
| | 651 | straight-line and outlier distributions. That is, the generative |
| | 652 | model the data $\ally$ is changed to |
| | 653 | \begin{eqnarray}\label{eq:mixture}\displaystyle |
| | 654 | \like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} |
| | 655 | \nonumber\\ |
| | 656 | \like &\propto& |
| | 657 | \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}} |
| | 658 | \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right. |
| | 659 | \nonumber \\ & & \quad |
| | 660 | \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}} |
| | 661 | \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right] |
| | 662 | \quad , |
| | 663 | \end{eqnarray} |
| | 664 | where, once again, $\Pbad$ is the probability that a data point is bad |
| | 665 | (or, more properly, the amplitude of the bad-data distribution |
| | 666 | function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of |
| | 667 | the bad-data distribution. |
| | 668 | |
| | 669 | Marginalization of the posterior generated by the likelihood in |
| | 670 | \equationname~(\ref{eq:mixture}) does \emph{not} involve exploring an |
| | 671 | exponential number of alternatives, and, again, when combined with a prior |
| | 672 | on $(m,b,\Pbad,\Ybad,\Vbad)$, can be marginalized over only three |
| | 673 | parameters to produce the posterior probability distribution function |
| | 674 | for the straight-line parameters $(m,b)$. This scale of problem is |
| | 675 | very well-suited to simple multi-dimensonal integrators, in particular |
| | 676 | it works very well with a Markov-Chain Monte Carlo, which we advocate |
| | 677 | below. |
| | 678 | |
| | 679 | Whether to use the exponential binary assignment or mixture methods |
| | 680 | depends---in principle---on your generative model for the data: Are |
| | 681 | there really ``good'' and ``bad'' points, drawn from different |
| | 682 | distributions, or are all points equivalent and drawn from a single |
| | 683 | distribution with good and bad parts to it? However, this question |
| | 684 | can rarely be answered (or even posed well)---in practice---in most |
| | 685 | problems of interest, so we strongly advise the latter, that is, |
| | 686 | fitting the distribution as a mixture of good and bad distributions, |
| | 687 | where there is no exponential complexity. |
| | 688 | |
| | 689 | % syntactical reference point |
| | 690 | |
| | 691 | Hogg...in both we have to choose priors on $(m,b,\Pbad,\Ybad,\Vbad)$...Hogg |
| | 692 | |
| 614 | | little or no trouble. The second piece of advice is to spend a bit of |
| 615 | | thought on initialization. That is, do not start the optimizer in an |
| 616 | | arbitrary point in parameter space, but rather initialize it near |
| 617 | | where you think the answer is likely to be, using either your prior |
| 618 | | knowledge, or else a linear fit (which provides an objective starting |
| 619 | | point). The third piece of advice is to check for multiple minima by |
| 620 | | randomly sampling parameter space in some significant region around |
| 621 | | the optimum you have, re-converge the nonlinear optimizer, and verify |
| 622 | | that you return to the same best optimum, or that if you come to a |
| 623 | | different one, it has a worse value for the objective. |
| | 701 | little or no trouble. The second piece of advice---especially if you |
| | 702 | use a continuous optimizer, such as gradient descent---is to spend a |
| | 703 | bit of thought on initialization. That is, do not start the optimizer |
| | 704 | in an arbitrary point in parameter space, but rather initialize it |
| | 705 | near where you think the answer is likely to be, using either your |
| | 706 | prior knowledge, or else a linear fit (which provides an objective |
| | 707 | starting point). The third piece of advice is to check for multiple |
| | 708 | minima by randomly sampling parameter space in some significant region |
| | 709 | around the optimum you have, re-converge the nonlinear optimizer, and |
| | 710 | verify that you return to the same best optimum, or that if you come |
| | 711 | to a different one, it has a worse value for the objective. |
| 698 | | We promised a faster---less computationally complex---method for |
| 699 | | modeling the outliers; here it is: Instead of requiring a hard |
| 700 | | assignment $q_i$ for each data point to either the straight-line or |
| 701 | | outlier populations, every data point can be modeled as being drawn |
| 702 | | from a \emph{mixture} (sum with amplitudes that sum to unity) of the |
| 703 | | straight-line and outlier distributions. That is, the generative |
| 704 | | model the data $\ally$ is changed to |
| 705 | | \begin{eqnarray}\label{eq:mixture}\displaystyle |
| 706 | | \like &\equiv& \frac{p(\ally|m,b,\Pbad,\Ybad,\Vbad,I)}{p(\ally|I)} |
| 707 | | \nonumber\\ |
| 708 | | \like &\propto& |
| 709 | | \prod_{i=1}^N \left[\frac{1-\Pbad}{\sqrt{2\,\pi\,\sigma_{yi}^2}} |
| 710 | | \,\exp\left(-\frac{[y_i-m\,x_i-b]^2}{2\,\sigma_{yi}^2}\right)\right. |
| 711 | | \nonumber \\ & & \quad |
| 712 | | \left.+ \frac{\Pbad}{\sqrt{2\,\pi\,[\Vbad+\sigma_{yi}^2]}} |
| 713 | | \,\exp\left(-\frac{[y_i-\Ybad]^2}{2\,[\Vbad+\sigma_{yi}^2]}\right)\right] |
| 714 | | \quad , |
| 715 | | \end{eqnarray} |
| 716 | | where, once again, $\Pbad$ is the probability that a data point is bad |
| 717 | | (or, more properly, the amplitude of the bad-data distribution |
| 718 | | function in the mixture), and $(\Ybad,\Vbad)$ are the parameters of |
| 719 | | the bad-data distribution. |
| 720 | | |
| 721 | | Optimization of the likelihood in \equationname~(\ref{eq:mixture}) |
| 722 | | does \emph{not} involve exploring an exponential number of |
| 723 | | alternatives, and---when combined with a prior on |
| 724 | | $(m,b,\Pbad,\Ybad,\Vbad)$---can be marginalized over three rather than |
| 725 | | $N+3$ parameters to produce a posterior for the straight-line |
| 726 | | parameters $(m,b)$. This scale of problem is very easy to solve with |
| 727 | | any simple optimizer, and in particular works very well with a |
| 728 | | Markov-Chain. |
| 729 | | |
| 730 | | Whether to use the binary assignment or mixture methods depends, in |
| 731 | | principle, on your generative model for the data: Are there really |
| 732 | | ``good'' and ``bad'' points, drawn from different distributions, or |
| 733 | | are all points equivalent and drawn from a single distribution with |
| 734 | | good and bad parts to it? However, this question can rarely be |
| 735 | | answered (or even posed well) in most problems of interest, so we |
| 736 | | strongly advise the latter, that is, fitting the distribution as a |
| 737 | | mixture of good and bad distributions, where there is no exponential |
| 738 | | complexity. |
| 739 | | |