\documentclass{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath, amsfonts}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{titling}
\usepackage{url}
\usepackage{xcolor}
\usepackage[colorlinks=true,urlcolor=blue]{hyperref}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Commands for customizing the assignment %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand \duedate {5 p.m. Wednesday, January 28, 2015}
\title{
10-601 Machine Learning: Homework 2\\
\vspace{0.2cm}
\large{
Due \duedate{}
}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Useful commands for typesetting the questions %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand \expect {\mathbb{E}}
\newcommand \mle [1]{{\hat #1}^{\rm MLE}}
\newcommand \map [1]{{\hat #1}^{\rm MAP}}
\newcommand \argmax {\operatorname*{argmax}}
\newcommand \argmin {\operatorname*{argmin}}
\newcommand \code [1]{{\tt #1}}
\newcommand \datacount [1]{\#\{#1\}}
\newcommand \ind [1]{\mathbb{I}\{#1\}}
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Document configuration %
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Don't display a date in the title and remove the white space
\predate{}
\postdate{}
\date{}
% Don't display an author and remove the white space
\preauthor{}
\postauthor{}
\author{}
%%%%%%%%%%%%%%%%%%
% Begin Document %
%%%%%%%%%%%%%%%%%%
\begin{document}
\maketitle
\section*{Instructions}
\begin{itemize}
\item {\bf Late homework policy:} Homework is worth full credit if
submitted before the due date, half credit during the next 48 hours,
and zero credit after that. You {\em must} turn in at least $n-1$
of the $n$ homeworks to pass the class, even if for zero credit.
\item {\bf Collaboration policy:} Homeworks must be done individually,
except where otherwise noted in the assignments. ``Individually''
means each student must hand in their own answers, and each student
must write and use their own code in the programming parts of the
assignment. It is acceptable for students to collaborate in figuring
out answers and to help each other solve the problems, though you
must in the end write up your own solutions individually, and you
must list the names of students you discussed this with. We will be
assuming that, as participants in a graduate course, you will be
taking the responsibility to make sure you personally understand the
solution to any work arising from such collaboration.
\item {\bf Online submission:} You must submit your solutions online
on
\href{https://autolab.cs.cmu.edu/courses/27/assessments/97}{autolab}.
We recommend that you use \LaTeX{}, but we will accept scanned
solutions as well. On the Homework 2 autolab page, you can download
the
\href{https://autolab.cs.cmu.edu/courses/27/assessments/97/attachments}{submission
template}, which is a tar archive containing blank placeholder
pdfs for each of the three problems. Replace each of these pdf files
with your solutions for the corresponding problem, create a new tar
archive of the top-level directory, and submit your archived
solutions online by clicking the ``Submit File'' button. You should
submit a single tar archive identical to the template, except with
each of the problem pdfs replaced with your solutions.
\textbf{\emph{DO NOT}} change the name of any of the files or
folders in the submission template. In other words, your submitted
pdfs should have exactly the same names as those in the submission
template. Do not modify the directory structure.
\end{itemize}
\section*{Problem 1: More Probability Review}
\begin{enumerate}[(a)]
\item {\bf [4 Points]} For events $A$ and $B$, prove
\[
P(A|B) = \frac{P(B|A)P(A)}{P(B)}.
\]
\item {\bf [4 Points]} For events $A$, $B$, and $C$, rewrite
$P(A,B,C)$ as a \emph{product} of several conditional probabilities
and one unconditional probability involving a single event. Your
conditional probabilities can use only one event on the left side of
the conditioning bar. For example, $P(A|C)$ and $P(A)$ would be
okay, but $P(A,B|C)$ is not.
\item {\bf [4 Points]} Let $A$ be any event, and let $X$ be a random
variable defined by
\[
X = \begin{cases}
1 & \hbox{if event $A$ occurs} \\
0 & \hbox{otherwise}.
\end{cases}
\]
$X$ is sometimes called the indicator random variable for the event
$A$. Show that $\expect[X] = P(A)$, where $\expect[X]$ denotes the
{\em expected value} of $X$.
\item Let $X$, $Y$, and $Z$ be random variables taking values in
$\{0,1\}$. The following table lists the probability of each
possible assignment of $0$ and $1$ to the variables $X$, $Y$, and
$Z$:
\begin{center}
\begin{tabular}{|l|c|c|c|c|}
\hline
&\multicolumn{2}{c|}{$Z=0$} & \multicolumn{2}{c|}{$Z=1$}\\\hline
& $X = 0$ & $X = 1$ & $X = 0$ & $X = 1$\\\hline
$Y = 0$ & $1/15$ & $1/15$ & $4/15$ & $2/15$ \\
$Y = 1$ & $1/10$ & $1/10$ & $8/45$ & $4/45$ \\
\hline
\end{tabular}
\end{center}
For example, $P(X = 0, Y = 1, Z = 0) = 1/10$ and
$P(X=1,Y=1,Z=1) = 4/45$.
\begin{enumerate}[(i)]
\item {\bf [4 Points]} Is $X$ independent of $Y$? Why or why not?
\item {\bf [4 Points]} Is $X$ conditionally independent of $Y$ given
$Z$? Why or why not?
\item {\bf [4 Points]} Calculate $P(X = 0 | X+Y > 0)$.
\end{enumerate}
\end{enumerate}
\section*{Problem 2: Maximum Likelihood and Maximum a Posteriori
Estimation}
This problem explores two different techniques for estimating an
unknown parameter of a probability distribution: the maximum
likelihood estimate (MLE) and the maximum a posteriori probability
(MAP) estimate.
Suppose we observe the values of $n$ iid\footnote{iid means
Independent, Identically Distributed.} random variables $X_1$,
\dots, $X_n$ drawn from a single Bernoulli distribution with parameter
$\theta$. In other words, for each $X_i$, we know that
\[
P(X_i = 1) = \theta \quad \hbox{and} \quad P(X_i = 0) = 1 - \theta.
\]
Our goal is to estimate the value of $\theta$ from these observed
values of $X_1$ through $X_n$.
\subsection*{Maximum Likelihood Estimation}
The first estimator of $\theta$ that we consider is the maximum
likelihood estimator. For any hypothetical value $\hat \theta$, we can
compute the probability of observing the outcome $X_1$, \dots, $X_n$
if the true parameter value $\theta$ were equal to $\hat \theta$.
This probability of the observed data is often called the {\em data
likelihood}, and the function $L(\hat \theta)$ that maps each
$\hat \theta$ to the corresponding likelihood is called the
\emph{likelihood function}. A natural way to estimate the unknown
parameter $\theta$ is to choose the $\hat \theta$ that maximizes the
likelihood function. Formally,
\[
\mle{\theta} = \argmax_{\hat \theta} L(\hat \theta).
\]
\begin{enumerate}[(a)]
\item {\bf [4 Points]} Write a formula for the likelihood function,
$L(\hat \theta)$. Your function should depend on the random
variables $X_1$, \dots, $X_n$ and the hypothetical parameter
$\hat \theta$. Does the likelihood function depend on the order of
the random variables?
\item {\bf [4 Points]} Suppose that $n = 10$ and the data set contains
six 1s and four 0s. Write a short computer program that plots the
likelihood function of this data for each value of $\hat \theta $ in
$\{0, 0.01, 0.02, \dots, 1.0\}$. For the plot, the $x$-axis should
be $\hat \theta$ and the $y$-axis $L(\hat \theta)$. Scale your
$y$-axis so that you can see some variation in its value. Please
submit both the plot and the code that made it. Please include all
plots for this question in the \code{problem2.pdf} file, as well as
the source code for producing them. That is, do not submit the
source code and plots as separate files.
\item {\bf [4 Points]} Estimate $\mle{\theta}$ by marking on the
$x$-axis the value of $\hat \theta$ that maximizes the
likelihood. Find a closed-form formula for the MLE. Does the closed
form agree with the plot?
\item {\bf [4 Points]} Create three more likelihood plots: one where
$n = 5$ and the data set contains three 1s and two 0s; one where
$n = 100$ and the data set contains sixty 1s and fourty 0s; and one
where $n = 10$ and there are five 1s and five 0s.
\item {\bf [4 Points]} Describe how the likelihood functions and
maximum likelihood estimates compare for the different data sets.
\end{enumerate}
\subsection*{Maximum a Posteriori Probability Estimation}
In the maximum likelihood estimate, we treated the true parameter
value $\theta$ as a fixed (non-random) number. In cases where we have
some prior knowledge about $\theta$, it is useful to treat $\theta$
itself as a random variable, and express our prior knowledge in the
form of a prior probability distribution over $\theta$. For example,
suppose that the $X_1$, \dots, $X_n$ are generated in the following
way:
\begin{itemize}
\item First, the value of $\theta$ is drawn from a given prior
probability distribution
\item Second, $X_1$, \dots, $X_n$ are drawn independently from a
Bernoulli distribution using this value for $\theta$.
\end{itemize}
Since both $\theta$ and the sequence $X_1$, \dots, $X_n$ are random,
they have a joint probability distribution. In this setting, a natural
way to estimate the value of $\theta$ is to simply choose its most
probable value given its prior distribution plus the observed data
$X_1$, \dots, $X_n$.
\[
\map{\theta} = \argmax_{\hat \theta} P(\theta = \hat \theta | X_1, \dots, X_n).
\]
This is called the maximum a posteriori probability (MAP) estimate of
$\theta$. Using Bayes rule, we can rewrite the posterior probability
as follows:
\[
P(\theta = \hat \theta | X_1, \dots, X_n)
=
\frac{P(X_1, \dots, X_n | \theta = \hat \theta) P(\theta = \hat \theta)}
{P(X_1, \dots, X_n)}.
\]
Since the probability in the denominator does not depend on
$\hat \theta$, the MAP estimate is given by
\begin{align*}
\map{\theta} &= \argmax_{\hat \theta}
P(X_1, \dots, X_n | \theta = \hat\theta)
P(\theta = \hat \theta)\\
&= \argmax_{\hat \theta} L(\hat \theta)
P(\theta = \hat\theta).
\end{align*}
In words, the MAP estimate for $\theta$ is the value $\hat \theta$
that maximizes the likelihood function multiplied by the prior
distribution on $\theta$. When the prior on $\theta$ is a continuous
distribution with density function $p$, then the MAP estimate for
$\theta$ is given by
\[
\map{\theta} = \argmax_{\hat \theta} L(\hat \theta) p(\hat \theta).
\]
For this problem, we will use a Beta(3,3) prior distribution for
$\theta$, which has density function given by
\[
p(\hat \theta) = \frac{\hat\theta^2 (1-\hat\theta)^2}{B(3,3)},
\]
where $B(\alpha, \beta)$ is the beta function and
$B(3,3) \approx 0.0333$.
\begin{enumerate}[(a)]
\setcounter{enumi}{5}
\item {\bf [4 Points]} Suppose, as in part (c), that $n = 10$ and we
observed six 1s and four 0s. Write a short computer program that
plots the function
$\hat \theta \mapsto L(\hat \theta)p(\hat \theta)$ for the same
values of $\hat \theta$ as in part (c).
\item {\bf [4 Points]} Estimate $\map{\theta}$ by marking on the
$x$-axis the value of $\hat \theta$ that maximizes the
function. Find a closed form formula for the MAP estimate. Does the
closed form agree with the plot?
\item {\bf [4 Points]} Compare the MAP estimate to the MLE computed
from the same data in part (c). Briefly explain any significant
difference.
\item {\bf [4 Points]} Comment on the relationship between the MAP and
MLE estimates as $n$ goes to infinity.
\end{enumerate}
\section*{Problem 3: Splitting Heuristic for Decision Trees}
Recall that the ID3 algorithm iteratively grows a decision tree from
the root downwards. On each iteration, the algorithm replaces one leaf
node with an internal node that splits the data based on one decision
attribute (or feature). In particular, the ID3 algorithm chooses the
split that reduces the entropy the most, but there are other
choices. For example, since our goal in the end is to have the lowest
error, why not instead choose the split that reduces error the most?
In this problem we will explore one reason why reducing entropy is a
better criterion.
Consider the following simple setting. Let us suppose each example is
described by $n$ boolean features:
$X = \langle X_1, \ldots X_n \rangle$, where $X_i \in \{0,1\}$, and
where $n \geq 4$. Furthermore, the target function to be learned is
$f: X \rightarrow Y$, where $Y = X_1 \vee X_2 \vee X_3$. That is,
$Y= 1$ if $X_1=1$ or $X_2 = 1$ or $X_3 = 1$, and $Y = 0$
otherwise. Suppose that your training data contains all of the $2^n$
possible examples, each labeled by $f$. For example, when $n = 4$, the
data set would be
\begin{center}
\begin{tabular}{cccc|c}
$X_1$ & $X_2$ & $X_3$ & $X_4$ & $Y$ \\
\hline
0 & 0 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 1 \\
0 & 1 & 0 & 0 & 1 \\
1 & 1 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 & 1 \\
1 & 0 & 1 & 0 & 1 \\
0 & 1 & 1 & 0 & 1 \\
1 & 1 & 1 & 0 & 1 \\
\end{tabular}
\hspace{3cm}
\begin{tabular}{cccc|c}
$X_1$ & $X_2$ & $X_3$ & $X_4$ & $Y$ \\
\hline
0 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 & 1 \\
0 & 1 & 0 & 1 & 1 \\
1 & 1 & 0 & 1 & 1 \\
0 & 0 & 1 & 1 & 1 \\
1 & 0 & 1 & 1 & 1 \\
0 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1 \\
\end{tabular}
\end{center}
\begin{enumerate}[(a)]
\item {\bf [4 Points]} How many mistakes does the best 1-leaf decision
tree make, over the $2^n$ training examples? (The 1-leaf decision
tree does not split the data even once)
\item {\bf [4 Points]} Is there a split that reduces the number of
mistakes by at least one? (I.e., is there a decision tree with 1
internal node with fewer mistakes than your answer to part (a)?)
Why or why not?
\item {\bf [4 Points]} What is the entropy of the output label $Y$ for
the 1-leaf decision tree (no splits at all)?
\item {\bf [4 Points]} Is there a split that reduces the entropy of
the output $Y$ by a non-zero amount? If so, what is it, and what is
the resulting conditional entropy of $Y$ given this split?
\end{enumerate}
\end{document}