\documentclass[11pt]{article}
% \usepackage{doublespace}
\usepackage{vmargin}
\usepackage{url}
\usepackage{amssymb,amsmath}
\usepackage{epsfig, graphicx,subfig}
\usepackage{float}
\setpapersize{USletter}
% \setmargnohfrb{30mm}{20mm}{20mm}{20mm}
% \setmargnohfrb{1in}{1in}{1in}{2in}
\setmargrb{1in}{1in}{1in}{1in}%
\markright{03-711 \hfill Computational Genomics an Molecular Biology,
Fall 2021 \hfill} \markright{}
\pagestyle{myheadings}
% \pagestyle{plain}
\begin{document}
\noindent
\section*{Seven11 Assignment 4
{\small \hfill Due 11:59pm, Friday, November 19th}}
\vspace{0.2in}
\textbf{Your name}:
\vspace{0.3in}
{\bf This assignment is required for students taking the 12-unit course only.}
\vspace{2em}
We have been exploring Markov and Hidden Markov models for problems in sequence
analysis. This assignment explores properties of sequences that are not well
suited to Markov models. Some of these models can be modeled by context
sensitive Hidden Markov models (csHMMs), which are described in this paper by
Yoon:
\vspace{0.1in}
% Reading:
\begin{itemize}
\item Yoon, Hidden Markov Models and their Applications in Biological Sequence
Analysis. \textit{Current Genomics} (2009) 10: p402-415.
\url{http://www.cs.cmu.edu/~durand/03-711/2021/Homework/Yoon09.pdf}
\end{itemize}
\vspace{0.5in}
\textbf{NOTE:} This text uses different notation from that used during
lecture or in the course textbooks. In particular, the article uses
$O$ to represent the alphabet, not an observed sequence of symbols.
The first half of the paper provides a good review of HMMs. I
recommend you read it to familiarize yourself with the notation used
throughout the rest of the paper. \vspace{0.1in}
\textit{Homework must be submitted by 11:59pm electronically to Canvas}.
\vspace{0.1in}
Collaboration is allowed on this homework. You must hand in homework
assignments individually. List the names of the people you worked with, and cite all additional references you used to complete this assignment:
\vspace{1in}
\newpage
%%% Prob #1
\begin{enumerate}
\item Palindromes: Restriction enzyme recognition sites are conserved motifs
that are 4, 6 or 8 nucleotides long and are commonly DNA palindromes. A DNA
palindrome is a sequence that is the same when read 5' to 3' on either strand;
that is, the first half of the palindrome is identical to the reverse
complement of the second half of the palindrome, read backwards. For example,
\texttt{C-T-A-G} is a DNA palindrome of length 4; \texttt{C-T-T-C} is not.
\begin{enumerate}
%%% PROB 1 a Palindrome %%
\item Draw a state diagram for a \textbf{\em Markov chain} model that assigns
a non-zero probability to any nucleotide palindrome of length 4 that starts
with an A. All other sequences should have zero probability. Each state in
your model should correspond to a single nucleotide. It is not necessary to
label the edges with transition probabilities.
\vspace{2ex}
\vfill
%%% PROB 1 b Palindrome %%
\item How many states are required for a \textbf{\em Markov chain} that models
all nucleotide palindromes of length 4, starting with {\em any} nucleotide?
\vspace{2ex}
\vspace{0.5in}
\newpage
%%% PROB 1 c Palindrome %%
\item Draw a state diagram for a Markov chain model that assigns a non-zero
probability to any palindromic nucleotide sequence of length 6 starting with
an A.
\vspace{2ex}
\vfill
%%% PROB 1 d Palindrome %%
\item How many states are required for a Markov chain that assigns a
probability to any nucleotide palindrome of length 6, with no restrictions
on the first nucleotide?
\vspace{2ex}
\vspace{0.5in}
%%% PROB 1 e Palindrome %%
\item Are Markov chains a suitable model for palindromes? Why or why not?
\vspace{2ex}
\vspace{1.5in}
\end{enumerate}
\newpage
%%% Prob #2
\item In the context sensitive HMM shown in Figure 4 in Yoon, 2009, the states
$P_1$ and $C_1$ share memory implemented as a stack (First-In-Last-Out). This
model emits sequences of the form $x_1 x_2 \ldots x_N x_N \ldots x_2 x_1$ and
$x_1 x_2 \ldots x_N x_{N+1} x_N \ldots x_2 x_1$.
Suppose you replaced the stack with a queue (First-In-First-out) data
structure. What would the sequences emitted by this modified csHMM look like?
\vspace{2ex}
\vfill
\newpage
%%% Prob #3
\item In addition to restriction sites, near palindromes with complementary base
pairing arise frequently in RNA molecules because they correspond to stem
secondary structures, like the hairpin shown here:
\vspace{-2ex}
\begin{center}
\includegraphics[scale=1]{RNA_hairpin.png}
\end{center}
The csHMM in Figure 4 in Yoon, 2009 can be easily modified to emit DNA
palindromes with complementary base pairing. For example, if the symbol
emitted by $P_1$ at time $i$ is an $A$, then the emission probabilities will
be \vspace{-2ex}
$$
e(x_j| x_i = A, y_i = P_1, y_j = C) =
\begin{cases}
1 \quad\quad x_j = T, \\
0 \quad\quad x_j \in \{A, C, G\}.
\end{cases}
$$
The model in Fig.~4 only emits exact palindromes. If these formed RNA
secondary structures, they would look like this:
\vspace{-4ex}
\begin{center}
\includegraphics[scale=1]{palindromes.png}
\end{center}
However, in RNA molecules, the two halves of the palindrome are typically
separated by several residues, as shown in the hairpin figure at the top of
the page.
Draw a schematic of a modified version of the csHMM in Fig. 4 that emits
sequences of a hairpin with \textit{at least} 3 and \textit{at most} 5
residues between the two halves of the stem. The two halves of the stem may
be of any length greater than zero.
\vspace{2ex}
\vfill
\newpage
%%% Prob #4
\item Sometimes RNA stems are interrupted by a bulge, like this one:
\vspace{0.1in}
\begin{center}
\includegraphics[scale=1]{RNA_bulge.png}
\end{center}
Draw a schematic of a modified variant of the csHMM in Fig. 4 that emits an
RNA stem secondary structure with a bulge on one side. The two halves of the
stem may be of any length greater than zero. The hairpin should be of exactly
length 3, and the bulge should be of length 2, exactly.
\vspace{2ex}
\vfill
\newpage
%%% Prob #5
\item Draw a schematic of a modified variant of the csHMM in Fig. 4 that emits
this secondary structure, called a pseudoknot. The two hairpin sequences
should each be of length {\em at least} one.
\begin{center}
\includegraphics[scale=0.85]{RNA_pseudoknot.png}
\end{center}
\vspace{2ex}
\vfill
\end{enumerate}
\end{document}