Stat341 / CM 361
From Wiki Course Notes
Computational Statistics and Data Analysis is a course offered at the University of Waterloo
Spring 2009
Instructor: Ali Ghodsi
Sampling (Generating random numbers)
Generating Random Numbers - May 12, 2009
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).
We begin by considering the simplest case: the uniform distribution.
Multiplicative Congruential Method
One way to generate pseudo random numbers from the uniform distribution is using the Multiplicative Congruential Method. This involves three integer parameters a, b, and m, and a seed variable x0. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:
For example, with a = 13, b = 0, m = 31, x0 = 1, we have:
So,
etc.
The above generator of pseudorandom numbers is called a Mixed Congruential Generator or Linear Congruential Generator, as they involve both an additive and a muliplicative term. For correctly chosen values of a, b, and m, this method will generate a sequence of integers including all integers between 0 and m - 1. Scaling the output by dividing the terms of the resulting sequence by m - 1, we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.
Of course, not all values of a, b, and m will behave in this way, and will not be suitable for use in generating pseudo random numbers.
For example, with a = 3, b = 2, m = 4, x0 = 1, we have:
So,
etc.
For an ideal situation, we want m to be a very large prime number,
for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that a = 75, b = 0, and m = 231 - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.
Java's Random class is based on a generator with a = 25214903917, b = 11, and m = 248[1]. The class returns at most 32 leading bits from each xi, so it is possible to get the same value twice in a row, (when x0 = 18698324575379, for instance) without repeating it forever.
General Methods
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.
Inverse Transform Method
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:
Theorem:
If
is a random variable and X = F − 1(U), where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).
Proof:
Recall that, if f is the pdf corresponding to F where f is defined as 0 outside of its domain, then,
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.
Note also that in the uniform distribution on [0, 1], we have for all a within [0, 1],
.
So,
Completing the proof.
Procedure (Continuous Case)
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:
- Step 1: Draw
.
- Step 2: Compute X = F − 1(U).
Example:
Suppose we want to draw a sample from f(x) = λe − λx where x > 0 (the exponential distribution).
We need to first find F(x) and then its inverse, F − 1.
Now we can generate our random sample
from F(x) by:
The xi are now a random sample from f(x).
This example can be illustrated in Matlab using the code below. Generate ui, calculate xi using the above formula and letting θ = 1, plot the histogram of xi's for i = 1,...,100,000.
u=rand(1,100000); x=-log(1-u)/1; hist(x)
The major problem with this approach is that we have to find F − 1 and for many distributions it is too difficult (or impossible) to find the inverse of F(x). Further, for some distributions it is not even possible to find F(x) (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find F − 1).
Procedure (Discrete Case)
The above method can be easily adapted to work on discrete distributions as well.
In general in the discrete case, we have
where:
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of U = 1 is missed):
- Step 1: Draw
.
- Step 2:
- If U < p0, return X = x0
- If
, return X = x1
- ...
- In general, if
, return X = xk
Example (from class):
Suppose we have the following discrete distribution:
The cumulative density function (cdf) for this distribution is then:
Then we can generate numbers from this distribution like this, given
from
:
This example can be illustrated in Matlab using the code below:
p=[0.3,0.2,0.5];
for i=1:1000;
u=rand;
if u <= p(1)
x(i)=0;
elseif u < sum(p(1,2))
x(i)=1;
else
x(i)=2;
end
end
Acceptance-Rejection Sampling - May 14, 2009
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the Acceptance/Rejection Method.
Acceptance/Rejection Method
Suppose we wish to sample from a target distribution f(x) that is difficult to sample from directly.
Let g(x) be a distribution that is easy to sample from and satisfies the condition:
, where
Since c*g(x) > f(x) for all x, it is possible to obtain samples that follows f(x) by rejecting a proportion of samples drawn from c*g(x).
This proportion depends on how different f(x) and g(x) are and may vary at different values of x.
That is, if
, we will need to reject more samples drawn at
than at
.
Overall, it can shown that by accepting samples drawn from g(x) with probability
, we can obtain samples that follows f(x)
Consider the example in the graph,
Sampling y = 7 from cg(x) will yield a sample that follows the target distribution f(x) and will y be accepted w/p 1.
Sampling y = 9 from cg(x) will yield a point that is distant from f(x) and will be accepted with a low probability.
Proof
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.

by Bayes' theorem

by hypothesis.
Then,
Therefore,
as required.
Procedure (Continuous Case)
- Choose g(x) (a density function that is simple to sample from)
- Find a constant c such that :
- Let
- Let
- If
then X=Y; else reject and go to step 1
Example:
Suppose we want to sample from Beta(2,1), for
.
Recall:
- Choose
- Find c: c = 2 (see notes below)
- Let
- Let
- If
, then X=Y; else go to step 1
c was chosen to be 2 in this example since
in this example is 2. This condition is important since it helps us in finding a suitable c to apply the Acceptance/Rejection Method.
In MATLAB, the code that demonstrates the result of this example is:
j = 1;
while i < 1000
y = rand;
u = rand;
if u <= y
x(j) = y;
j = j + 1;
i = i + 1;
end
end
hist(x);
The histogram produced here should follow the target distribution, f(x) = 2x, for the interval
.
The histogram shows the PDF of a Beta(2,1) distribution as expected.
The Discrete Case
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution g(x) must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.
Example
Suppose we want to sample from a distribution with the following probability mass function (pmf):
P(X=1) = 0.15 P(X=2) = 0.55 P(X=3) = 0.20 P(X=4) = 0.10
- Choose g(x) to be the discrete uniform distribution on the set {1,2,3,4}
- Find c:
- Generate
- Let
- If
, then X=Y; else go to step 1
Limitations
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such g(x) that satisfies all the conditions of the procedure.
We will now begin to discuss sampling from specific distributions.
Special Technique for sampling from Gamma Distribution
Recall that the cdf of the Gamma distribution is:
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.
Additive Property of Gamma Distribution
Recall that if
are independent exponential distributions with mean λ (in other words,
), then
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean λ (using the Inverse Method) and add them up. Details will be discussed in the next lecture.
Techniques for Normal and Gamma Sampling - May 19, 2009
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,
Gamma distributions and Normal distributions, where such practical methods exist.
Gamma Distribution
Given the additive property of the gamma distribution,
If
are independent random variables with
then,
Using this property, we can use the Inverse Transform Method to generate samples from an exponential distribution with appropriate variables, and use these to generate a sample following a Gamma distribution.
- Procedure
- Sample independently from a uniform distribution t times, giving
- Sample independently from an exponential distribution t times, giving
such that,
Using the Inverse Transform Method,
- Using the additive property,
- Sample independently from a uniform distribution t times, giving
This procedure can be illustrated in Matlab using the code below with t = 5,λ = 1 :
U = rand(10000,5); X = sum( -log(U), 2); hist(X)
Normal Distribution
The cdf for the Standard Normal distribution is:
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.
Box-Muller Method
[Add a picture WikiSysop 19:25, 1 June 2009 (UTC)]
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.
- Properties of Polar and Cartesian Coordinates
- If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,
Let X and Y be independent random variables with a standard normal distribution,
also, let r and θ be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,
It can also be shown that the joint distribution of r and θ is given by,
Note that
consists of two density functions, Exponential and Uniform, so assuming that r and θ are independent
- Procedure for using Box-Muller Method
- Sample independently from a uniform distribution twice, giving
- Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,
- Transform r and θ back to x and y,
- Sample independently from a uniform distribution twice, giving
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.
This procedure can be illustrated in Matlab using the code below:
u1 = rand(5000,1);
u2 = rand(5000,1);
d = -2*log(u1);
theta = 2*pi*u2;
x = d.^(1/2).*cos(theta);
y = d.^(1/2).*sin(theta);
figure(1);
subplot(2,1,1);
hist(x);
title('X');
subplot(2,1,2);
hist(y);
title('Y');
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:
subplot(2,1,1);
hist(d);
title('d follows an exponential distribution');
subplot(2,1,2);
hist(theta);
title('theta follows a uniform distribution over [0, 2*pi]');
Useful Properties (Single and Multivariate Normal)
The Box-Muller method as described above samples only from the standard normal distribution. However, both singlevariate and multivariate normal distributions have properties that allow us to use samples generated by the Box-Muller method to sample any normal distribution in general.
- Properties of Normal distributions
These properties can be illustrated through the following example in Matlab using the code below:
Example: For a Multivariate Normal distribution
and
u = [-2; 3]; sigma = [ 1 1/2; 1/2 1]; r = randn(15000,2); ss = chol(sigma); X = ones(15000,1)*u' + r*ss; plot(X(:,1),X(:,2), '.');
Note: In the example above, we generated the square root of Σ using the Cholesky decomposition,
ss = chol(sigma);
which gives
. Matlab also has the sqrtm function:
ss = sqrtm(sigma);
which gives a different matrix,
, but the plot looks about the same (X has the same distribution).
Bayesian and Frequentist Schools of Thought - May 21, 2009
In this lecture we will continue to discuss sampling from specific distributions , introduce Monte Carlo Integration, and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to Bayesian Inference.
Binomial Distribution
A Binomial distribution
is the sum of n independent Bernoulli trials, each with probability of success p
. For each trial we generate an independent uniform random variable:
. Then X is the number of times that
. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.
Sampling from Binomial distribution in Matlab is done using the following code:
n=3; p=0.5; trials=1000; X=sum((rand(trials,n))'<=p); hist(X)
Where the histogram is a Binomial distribution, and for higher n, it would resemble a Normal distribution.
Monte Carlo Integration
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals by simulation. In this course we will mainly look at three methods for approximating integrals:
- Basic Monte Carlo Integration
- Importance Sampling
- Markov Chain Monte Carlo (MCMC)
Bayesian VS Frequentists
During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:
Frequentist
- Probability is objective and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.
- Parameters are all fixed and unknown constants.
- Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.
Bayesian
- Probability is subjective and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.
- Parameters are random variables that has a given distribution, and other probability statements can be made about them.
- Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution.
Bayesian Inference
Example:
If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as
, which is our data, and the parameter of the data for this case, which we will refer to as θ, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding
, whereas a Frequentist would be more interested in finding
Bayes' Rule
Note: In this case f(θ | X) is referred to as posterior, f(X | θ) as likelihood, f(θ) as prior, and f(X) as the marginal, where θ is the parameter and X is the observed variable.
Procedure in Bayesian Inference
- First choose a probability distribution as the prior, which represents our beliefs about the parameters.
- Then choose a probability distribution for the likelihood, which represents our beliefs about the data.
- Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.
Methods
- Mode:
value of θ that maximizes f(θ | X)
- Mean:
If it is the case that θ is high-dimensional, and we are only interested in one of the components of θ, for example, we want θ1 from
, then we would have to calculate the integral:
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.
Note:
is not a function of θ, and is called the Normalization Factor
- Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior:
Monte Carlo Integration - May 26, 2009
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces Basic Monte Carlo Integration.
Frequentist vs Bayesian Example - Estimating Parameters
Estimating parameters of a univariate Gaussian:
Frequentist: 
Bayesian:
Frequentist Approach
Let XN denote (x1,x2,...,xn). Using the Maximum Likelihood Estimation approach for estimating parameters we get:
Setting
we get
Bayesian Approach
Assuming the prior is Gaussian:
By completing the square we conclude that the posterior is Gaussian as well:
Where
The expectation from the posterior is different from the MLE method.
Note that
. Also note that when N = 0 we get
.
Basic Monte Carlo Integration
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term Monte Carlo is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. Stochastic simulation refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as Monte Carlo methods. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [1]
We start with a simple example:
where
the p.d.f. is Unif(a,b)
If
then by the Law of Large Numbers
Process
Given
From this we can compute other statistics, such as
-
where
with
-
CI's can be estimated as
Example 1
Find
for
- We need to draw from
-
This example can be illustrated in Matlab using the code below:
u=rand(100,1) x=-log(u) h= x.^ .5 mean(h) %The value obtained using the Monte Carlo method F = @ (x) sqrt (x). * exp(-x) quad(F,0,50) %The value of the real function using Matlab
Example 2
Find
This example can be illustrated in Matlab using the code below:
x = rand (1000) mean(x^3)
Example 3
To estimate an infinite integral
such as
- Substitute in
-
-
-
-
Importance Sampling and Monte Carlo Simulation - May 28, 2009
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.
Binomial Probability Monte Carlo Simulations
Example 1:
You are given two independent Binomial distributions with probabilities
. Using a Monte Carlo simulation, approximate the value of
, where
.
;
; 
So
where
is a flat distribution and the expected value of
is as follows:
Since X, Y are independent, we can split the conditional probability distribution:
We need to find conditional distribution functions for
to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let
and
, where

Process:
- Draw samples for
and
:
,
, ...,
;
- Compute
in order to get n values for
;
-
.
Matlab Code:
- The Matlab code for recreating the above example is as follows:
n=100; %number of trials for X m=100; %number of trials for Y x=80; %number of successes for X trials y=60; %number of successes for y trials p1=betarnd(x+1, n-x+1, 1, 1000); p2=betarnd(y+1, m-y+1, 1, 1000); delta=p1-p2; mean(delta);
The mean in this example is given by 0.1938.
A 95% confidence interval for δ is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:
q1=quantile(delta,0.025); q2=quantile(delta,0.975);
The interval is approximately
Note: In this case, we can also find E(δ) analytically since
. Compare this with the maximum likelihood estimate for δ:
.
Example 2:
Bayesian Inference for Dose Response
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:
For each dose
we test n rats and observe
, the number of rats that die. Therefore,

.
We can assume that the probability of death grows with the concentration of drug given, i.e.
. Estimate the dose at which the animals have at least 50% chance of dying.
- Let
where
- We are interested in
since any higher concentrations are known to have a higher death rate.
Solving this analytically is difficult:
where g is an unknown function

- where

- where
Process: Monte Carlo
We assume that
- Draw
- Keep sample only if it satisfies
, otherwise discard and try again.
- Compute
by finding the first
sample with over 50% deaths.
- Repeat process n times to get n estimates for
.
-
.
For instance, for each dose level Xi, for 1 < = i < = 10, 10 rats are used and it is observed that the numbers that are dying is Yi, where Y1 = 4,Y2 = 3,etc.
Importance Sampling
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. As shown in the figure, the uniform distribution
is a proposal distribution to sample from and g(x) is the target distribution.
Here we cast the integral
, as the expectation of g(x) with respect to U such that,
.
Hence we can approximate this integral by,
.
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]
In
, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then
can be written as
-

the expectation of
with respect to g(x) and therefore
-
-
Process
- Choose
such that it's easy to sample from and draw
- Compute
-

Note: By the law of large numbers, we can say that
converges in probability to I.
"Weighted" average
- The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from
that are closer to the original distribution
, which we would ideally like to sample from (but cannot because it is too difficult).

which is like saying that we are applying a regular Monte Carlo Simulation method to
, and taking each result from this process and weighting the more accurate ones (i.e. the ones for which
is high) higher.
One can view
as a weight.
Then 
i.e. we are computing a weighted sum of h(xi) instead of a sum
A Deeper Look into Importance Sampling - June 2, 2009
From last class, we have determined that an integral can be written in the form
We continue our discussion of Importance Sampling here.
Importance Sampling
We can see that the integral
is just
the expectation of h(x) with respect to g(x), where
is a weight
. In the case where
, a greater weight for
will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.
Problem
The method of Importance Sampling is simple but can lead to some problems. The
estimated by Importance Sampling could have infinite standard error.
Given
where
.
Obtaining the second moment,
We can see that if
, then
. This occurs if
has a thinner tail than
then
could be infinitely large. The general idea here is that
should not be large.
Remark 1
It is evident that
should be chosen such that it has a thicker tail than
.
If
is large over set
but
is small, then
would be large and it would result in a large variance.
Remark 2
It is useful if we can choose
to be similar to
in terms of shape. Ideally, the optimal
should be similar to
, and have a thicker tail. It's important to take the absolute value of
, since a variance can't be negative. Analytically, we can show that the best
is the one that would result in a variance that is minimized.
Remark 3
Choose
such that it is similar to
in terms of shape. That is, we want
Theorem (Minimum Variance Choice of
)
The choice of
that minimizes variance of
is
.
Proof:
We know that
The variance of
is
As we can see, the second term does not depend on
. Therefore to minimize
we only need to minimize the first term. In doing so we will use Jensen's Inequality.
If
is a convex function ( twice differentiable and
) then 
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:
where
- =======================================================
Proof (cont)
Using Jensen's Inequality,
as
Therefore
and

since
and
are density functions,
are non negative.
Thus, this is a lower bound on
. If we replace
into
, we can see that the result is as we require. Details omitted.
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute
.
Note: Jensen's inequality is actually unnecessary here. We just use it to get
, which could be derived using variance properties:
.
Importance Sampling and Markov Chain Monte Carlo (MCMC) - June 4, 2009
Remark 4:
where 
Apply the idea of importance sampling to both the numerator and denominator:
-
-
-
where
-
The above results in the following form of Importance Sampling:
where
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.
Example of Importance Sampling
Estimate
when 

- Where
Approach I: Monte Carlo
where
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):
format long for i = 1:100 a(i) = sum(randn(100,1)>=3)/100; end meanMC = mean(a) varMC = var(a)
On running this, we get meanMC = 0.0005 and
Approach II: Importance Sampling
where f(x) is standard normal and g(x) needs to be chosen wisely so that it is similar to the target distribution.
- Let
This example can be illustrated in Matlab using the code below:
for j = 1:100
N = 100;
x = randn (N,1) + 4;
for ii = 1:N
h = x(ii)>=3;
b = exp(8-4*x(ii));
w(ii) = h*b;
end
I(j) = sum(w)/N;
end
MEAN = mean(I)
VAR = var(I)
Running the above code gave us
and
which is very close to 0, and is much less than the variability observed when using Monte Carlo
Markov Chain Monte Carlo (MCMC)
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain.
Consider the same integral:
Idea: If
is a Markov Chain with stationary distribution f(x), then under some conditions
Stochastic Process:
A Stochastic Process is a collection of random variables
- State Space Set:
is the set that random variables
takes values from.
- Indexed Set:
is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.
Example 1
i.i.d random variables
State Space
Indexed Set
Example 2
: price of a stock
: opening date of the market
Basic Fact: In general, if we have random variables
where
Markov Chain:
A Markov Chain is a special form of stochastic process in which
depends only on
.
For example,
Transition Probability:
The probability of going from one state to another state.
Transition Matrix:
For n states, transition matrix P is an
matrix with entries
as below:
Example:
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is
. And the probability of moving to the left is
. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:
Markov Chain Definitions - June 9, 2009
Practical application for estimation:
The general concept for the application of this lies within having a set of generated xi which approach a distribution f(x) so that a variation of importance estimation can be used to estimate an integral in the form
by 
All that is required is a Markov chain which eventually converges to f(x).
In the previous example, the entries pij in the transition matrix P represent the probability of reaching state j from state i after one step. For this reason, the sum over all entries in a particular row sum to 1, as this itself must be a pmf if a transition from i is to lead to a state still within the state set for Xt.
Homogeneous Markov Chain
The probability matrix P is the same for all indicies
.
If we denote the pmf of Xn by a probability vector
where i denotes an ordered index of all possible states of X.
Then we have a definition for the
marginal probabilty 
where we simplify Xn to represent the ordered index of a state rather than the state itself.
From this definition it can be shown that,
Proof:
and since
Therefore,
With this, it is possible to define P(n) as an n-step transition matrix where 
Theorem: 
Proof:
From the previous conclusion
And since this is a homogeneous Markov chain, P does not depend on i so
From this it becomes easy to define the n-step transition matrix as
Summary of definitions
- transition matrix is an NxN when N = | X | matrix with Pij = Pr(X1 = j | X0 = i) where

- n-step transition matrix also NxN with Pij(n) = Pr(Xn = j | X0 = i)
- marginal (probability of X)μn(i) = Pr(Xn = i)
- Theorem: Pn = Pn
- Theorem: μn = μ0Pn
---
Definitions of different types of state sets
Define
State Space
If Pij(n) > 0 for some n , then we say i reaches j denoted by
This also mean j is accessible by i:
If
and
then we say i and j communicate, denoted by
Theorems
1) 
2) 
3) If 
4) The set of states of X can be written as a unique disjoint union of subsets (equivalence classes)
where two states i and j communicate IFF they belong to the same subset
More Definitions
A set is Irreducible if all states communicate with each other (has only one equivalence class).
A subset of states is Closed if once you enter it, you can never leave.
A subset of states is Open if once you leave it, you can never return.
An Absorbing Set is a closed set with only 1 element (i.e. consists of a single state).
Note
- We cannot have
with i recurrent and j transient since
.
- All states in an open class are transient.
- A Markov Chain with a finite number of states must have at least 1 recurrent state.
- A closed class with an infinite number of states has all transient or all recurrent states.
Again on Markov Chain - June 11, 2009
Decomposition of Markov chain
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.
Example:
Let
and the transition matrix be:
The decomposition in classes is:
- class 1:
From the matrix we see that the states 1 and 2 have only
and
as nonzero transition probability
- class 2:
The state 3 can go to every other state but none of the others can go to it
- class 3:
This is an absorbing state since it is a close class and there is only one element
- class 1:
Recurrent and Transient states
A state i is called
or
if
for some
That means that the probability to come back to the state i, starting from the state i, is 1.
If it is not the case (ie. probability less than 1), then state i is
.
It is straight forward to prove that a finite irreducible chain is recurrent.
Theorem
Given a Markov chain,
A state
is
if and only if
A state
is
if and only if
Properties
- If
is
and
then
is
- If
is
and
then
is
- In an equivalence class, either all states are recurrent or all states are transient
- A finite Markov chain should have at least one recurrent state
- The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.
Example
Let
and suppose that
,
and
otherwise.
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.
We now see other properties of this particular Markov chain:
- Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.
- Consider now the case in which we are in state
. If we move of n steps to the right or to the left, the only way to go back to
is to have n steps on the opposite direction.
We now want to know if this event is transient or recurrent or, equivalently, whether
or not.
To proceed with the analysis, we use the
:
The probability could therefore be approximated by:
And the formula becomes:
We can conclude that if
then the state is transient, otherwise is recurrent.
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:
Then substitute in x = pq.
So we reach the same conclusion: all states are recurrent iff
.
Convergence of Markov chain
We define the

as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.
The mean of the recurrent time
is defined as:
where
Using the objects we just introduced, we say that:
Lemma
In a finite state Markov Chain, all the recurrent state are positive
Periodic and aperiodic Markov chain
A Markov chain is called
of period
if, starting from a state, we will return to it every we can only return to that state in a multiple of n steps.
steps with probability 
Example
Considerate the three-state chain:
It's evident that, starting from the state 1, we will return to it on every 3rd step and so it works for the other two states. The chain is therefore periodic with perdiod d = 3
An irreducible Markov chain is called
if:
Another Example
Consider the chain
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps
period d = 2
Markov Chains and their Stationary Distributions - June 16, 2009
New Definition:Ergodic
A state is Ergodic if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.
Define a vector π where
and
| ∑ | πi = 1 |
| i |
(ie. π is a pmf)
π is a stationary distribution if π = πP where P is a transition matrix.
Limiting Distribution
If as
then π is the limiting distribution of the Markov Chain represented by P.
Theorem: An irreducible, ergodic Markov Chain has a unique stationary distribution π and there exists a limiting distribution which is also π.
Detailed Balance
The condition for detailed balanced is
Theorem
If π satisfies detailed balance then it is a stationary distribution.
Proof.
We need to show π = πP
as required
Warning! A chain that has a stationary distribution does not necessarily converge.
For example,
has a stationary distribution
but it will not converge.
Stationary Distribution
π is stationary (or invariant) distribution if π = π * p [0.5 0 0.5] Half of time their chain will spend half time in 1st state and half time in 3rd state.
Theorem
An irreducible ergodic Markov Chain has a unique stationary distribution π. The limiting distribution exists and is equal to π.
If g is any bounded function, then with probability 1:
Example
Find the limiting distribution of
Solve π = πP



Also 
We can solve the above system of equations and obtain


Thus,
and we get
Subbing
back into the system of equations we obtain
and
Therefore the limiting distribution is
Monte Carlo using Markov Chain - June 18, 2009
Consider the problem of computing
Generate
,
,... from a Markov Chain with stationary distribution
Metropolis Hastings Algorithm
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.
Suppose we wish to sample from f(x), the 'target' distribution. Choose q(y | x), the 'proposal' distribution that is easily sampled from.
1. Initialize
, this is the starting point of the chain, choose it randomly and set index
2.
3. Compute
, where
4.
5. If
- then
- then
- else
6. Update index
, and go to step 2
A couple of remarks about the algorithm
Remark 1: A good choice for
is
where
is a constant. The starting point of the algorithm X0 = x, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.
Remark 2: If the proposal distribution is symmetric,
, then
. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).
Remark 3:
is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.
Example: The Cauchy distribution has density
Let the proposal distribution be
since
is symmetric
Now, having calculated
, we complete the problem in Matlab using the following code:
b=2; % let b=2 for now, we will see what happens when b is smaller or larger
X(1)=randn;
for i=2:10000
Y=b*randn+X(i-1); % we want to decide whether we accept this Y
r=min( (1+X(i-1)^2)/(1+Y^2),1);
u=rand;
if u<r
X(i)=Y; % accept Y
else
X(i)=X(i-1); % reject Y remaining in the current state
end;
end;
We need to be careful about choosing b!
- If b is too large
- Then the fraction
would be very small
is very small aswell.
- Then the fraction
- It is highly unlikely that
, the probability of rejecting
is high so the chain is likely to get stuck in the same state for a long time
chain may not coverge to the right distribution.
- It is highly unlikely that
- It is easy to observe by looking at the histogram of
, the shape will not resemble the shape of the target
- It is easy to observe by looking at the histogram of
- Most likely we reject y and the chain will get stuck.
- If b is too small
- Then we are setting up our proposal distribution
to be much narrower then than the target
so the chain will not have a chance to explore the sample state space and visit majority of the states of the target
.
- Then we are setting up our proposal distribution
- If b is just right
- Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".
Mathematical explanation for why this algorithm works:
We talked about
MC so far.
We have seen that:
-
satisfies detailed balance if
and
- if
satisfies
then it is a stationary distribution
In
case we write the Detailed Balance as
and say that
is
if
.
We want to show that if Detailed Balance holds (i.e. assume
) then
is stationary distribution.
That is to show:
is stationary distribution.
and since
Now, we need to show that detailed balance holds in the Metropolis-Hastings...
Consider 2 points
and
:
- Either
OR
(ignoring that it might equal to 1)
Without loss of generality. suppose that the product is
.
In this case
and
- Some intuitive meanings before we continue:
is jumping from
to
if proposal distribution generates
and
is accepted
is jumping from
to
if proposal distribution generates
and
is accepted
is the probability of generating
is the probability of generating
probability of accepting
probability of accepting
.
With that in mind we can show that
as follows:
Cancelling out
and bringing
to the other side we get
equation
Multiplying both sides by
we get
equation
Noticing that the right hand sides of the equation
and equation
are equal we conclude that:
as desired and thus showing that Metropolis-Hastings satisfies detailed balance.
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.
Metropolis Hastings Algorithm Continued - June 25
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.
, which means
is the stationary distribution of the chain.
But this is not enough, we want the chain to converge to the stationary distribution as well.
Thus, we also need it to be:
Irreducible: There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of
including the one that we used in the example in the previous lecture (which was normally distributed)
Aperiodic: The chain will not oscillate between different set of states. In the previous example,
is
, which will clearly not oscillate.
Next we discuss a couple of variations of Metropolis Hastings
Random Walk Metropolis Hastings
1. Draw
, where
has distribution
;
;
is current state &
is going to be close to
2. It means
. (Note that
is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form
. Hence we know in this version that
is symmetric
)
3.
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was
In matlab, we defined this as
(i.e
In this case, we need
The hard problem is to choose b so that the chain will mix well.
Rule of thumb: choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)
Example
If we draw
then
, accept
with probability 1
If we draw
then
, accept
with probability
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.
Independence Metropolis Hastings
In this case, the proposal distribution is independent of the current state, i.e.
We draw from a fixed distribution
And define
And, this does not work unless
is very similar to the target distribution
(i.e. usually used when
is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)
Now, we pose the question: if
does not depend on
, does it mean that the sequence generated from this chain is really independent?
Answer: Even though
does not depend on
, but
depends on
. So it's not really an independent sequence!
Thus, the sequence is not really independent because acceptance probability
depends on the state
Simulated Annealing
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.
Consider the problem where we want to find x such that the objective function h(x) is at it's minimum,

Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,
We consider a distribution function
such that
, where T is called the temperature, and define the following procedure:
- Set T to be a large number
- Start with some random
(note that
is usually chosen to be symmetric)
- Define
(when
is symmetric)

IE. Xi + 1 = Y if U < r
- Decrease T and go to Step 2
Now, we know that
Consider
Now, suppose T is large,
if
and we therefore accept
with probability 1
if
and we therefore accept
with probability
On the other hand, suppose T is arbitrarily small (
),
if
and we therefore accept
with probability 1
if
and we therefore reject
Example 1
Consider the problem of minimizing the function
We can plot this function and observe that it makes a local minimum near
We then plot the graphs of
for
and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.
Example 2 (from June 30th lecture)
Suppose we want to minimize the function
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use
and we begin with
Then the problem may be solved in MATLAB using the following:
function v = obj(x) v = (x - 2).^2;
T = 10; %this is the initial value of T, which we must gradually decrease
b = 2;
X(1) = 0;
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)
Y = b*randn + X(i-1);
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );
U = rand;
if U < r
X(i) = Y; %accept Y
else
X(i) = X(i-1); %reject Y
end;
end;
The first run (with
) gives us
Next, if we let
in the order displayed, then we get the following graph when we plot X:
i.e. it converges to the minimum of the function
Travelling Salesman Problem
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.[2]
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled Introduction to TSP - Infrastructure for the Travelling Salesman Problem. [2]
The examples are particularly useful because they are implemented using R (a statistical computing software environment).
Gibbs Sampling - June 30, 2009
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).[3] In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.
Advantages
- The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.
- It is useful for high-dimensional distributions.
Disadvantages[4]
- We rarely know how to sample from the conditional distributions.
- The algorithm can be extremely slow to converge.
- It is often difficult to know when convergence has occurred.
- The method is not practical when there are relatively small correlations between the random variables.
Example: Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e.
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing
and the resulting set of points drawn
follows the required multivariate distribution.
Suppose we want to sample from a bivariate distribution
with initial point
, i = 0
Furthermore, suppose that we know how to sample from the conditional distributions
and
-
(i.e. given the previous point, sample a new point)
-
(note: it must be
not Yi, otherwise detailed balance may not hold)
- Repeat Steps 1 and 2
Note This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last
rather than all of them.
Example
Suppose we want to generate samples from a bivariate normal distribution where
and
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So,
and
The problem (for a specified value
) may be solved in MATLAB using the following:
Y = [1 ; 2]; rho = 0.01; sigma = sqrt(1 - rho^2); X(1,:) = [0 0];
for i = 2:5000
mu = Y(1) + rho*(X(i-1,2) - Y(2));
X(i,1) = mu + sigma*randn;
mu = Y(2) + rho*(X(i-1,1) - Y(1));
X(i,2) = mu + sigma*randn;
end;
%plot(X(:,1),X(:,2),'.') plots all of the points
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points ->
this demonstrates that convergence occurs after a while
(this is called the burning time)
The output of plotting all points is:
Metropolis-Hastings within Gibbs Sampling - July 2
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.
- Choose
as a proposal distribution for X (assuming Y fixed).
- Choose
as a proposal distribution for Y (assuming X fixed).
- Do a Metropolis-Hastings step for X, treating Y as fixed.
- Do a Metropolis-Hastings step for Y, treating X as fixed.
- Start with some random variables
- Draw
- Set
-
- Draw
- Set
-
- Set
, return to step 2 and repeat the same procedure
- Start with some random variables
Page Ranking and Review of Linear Algebra - July 7
Page Ranking
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page.
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.
Factors relating to importance of links
1) Importance (rank) of linking web page (higher importance is better).
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).
Definitions
Under this formula,
is never zero. We weight the sum and the constant using
(which is just a coefficient between 0 and 1 used to balance the objective function).
In Matrix Form
where
are both
X
matrices
Solving for P
Since rank is a relative term, if we make an assumption that
then we can solve for P (in matrix form this is
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.
To solve for P, we start with a random guess
and repeatedly apply
Since this is a stationary series, for large n
.
Exemple of utilisation of Page Ranking :
Suppose that the links between four webpages are:
1 → 2
↕ ↙
3 ← 4
According to the factors relating to importance of links, we can consider two possible rankings :
or
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.
To rank our pages , we use the formula :
Where
In this example,
1)
2)
,
So that here,
3) We have D=diag(C), so:
Algorithm on matlab:
L=[0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0]
C=[2 1 1 1]
D=diag(C)
d=0.85
A=(1-d)*ones(4)/4+d*L*inv(D)
[vec val]=eig(A)
We are now going to choose a random vector p:
p=rand(1,4)
p=p/sum(p)
p=p'
p=A*p
p=A*p
etc...
p=A*p
p =
0.3725 0.1958 0.3942 0.0375
So we conclude that
as the probability of being on page three > Pr (page 1) > Pr (page 2) > Pr (page 4)
Linear Algebra Review
Inner Product[5]
Note that the inner product is also referred to as the dot product.
If
then the inner product is :
The length (or norm) of
is the non-negative scalar
defined by
For
and
in
, the distance between
and
written as
, is the length of the vector
. That is,
If
and
are non-zero vectors in
or
, then the angle between
and
is given as
Orthogonal and Orthonormal[6]
Two vectors
and
in
are orthogonal (to each other) if
By the Pythagorean Theorem, it may also be said that two vectors
and
are orthogonal if and only if
Two vectors
and
in
are orthonormal if
and
An orthonormal matrix
is a square invertible matrix, such that
or alternatively
Note that an orthogonal matrix is an orthonormal matrix.
Dependence and Independence[7]
The set of vectors
in
is said to be linearly independent if the vector equation
has only the trivial solution (i.e.
).
The set of vectors
in
is said to be linearly dependent if there exists a set of coefficients
(not all zero), such that
.
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent.
That is, any vector set
in
is linearly dependent if p > n.
If a vector set
in
contains the zero vector, then the set is linearly dependent.
Trace and Rank[8]
The trace of a square matrix
, denoted by
, is the sum of the diagonal entries in
. That is,
Note that an alternate definition for the trace is:
i.e. it is the sum of all the eigenvalues of the matrix
The rank of a matrix
, denoted by
, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.
A square matrix is non-singular if and only if its rank equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).
A matrix that is not invertible is sometimes called a singular matrix.
A matrix is said to be non-singular if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.
A square matrix is said to be orthogonal if AAT = ATA = I.
For a square matrix A,
- if
,then A is said to be positive-definite.
- if
,then A is said to be positive-semidefinite.
The inverse of a square matrix A is denoted by A − 1 and is such that AA − 1 = A − 1A = I. The inverse of a matrix A exists if and only if A is non-singular.
The pseudo-inverse matrix
is typically used whenever A − 1 does not exist because A is either not square or singular:
with
.
Vector Spaces
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.
A set of vectors {u1,u2,u3,...un} is said to form a basis for a vector space if any arbitrary vector x can be represented by a linear combination of the {ui}: x = a1u1 + a2u2 + ... + anun
- The coefficients {a1,a2,...an} are called the components of vector x with the basis {ui}.
- In order to form a basis, it is necessary and sufficient that the {ui} vectors be linearly independent.
A basis {ui} is said to be orthogonal if
A basis {ui} is said to be orthonormal if
Eigenvectors and Eigenvalues
Given matrix ANxN, we say that v is an 'eigenvector if there exists a scalar λ (the eigenvalue) such that Av = λv where λ is the corresponding eigenvalue.
Computation of eigenvalues
Characteristic Equation
Properties
- If A is non-singular all eigenvalues are non-zero.
- If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.
- If A is positive-definite all eigenvalues are positive
Linear Transformations
A linear transformation is a mapping from a vector space XN onto a vector space YM, and it is represented by a matrix
- Given vector
, the corresponding vector y on YM is computed as y = Ax.
- The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).
A linear transformation represented by a square matrix A is said to be orthonormal when AAT = ATA = I
- implies that AT = A − 1
- An orthonormal transformation has the property of preserving the magnitude of the vectors:
- An orthonormal matrix can be thought of as a rotation of the reference frame
- The row vectors of an orthonormal transformation form a set of orthonormal basis vectors.
Interpretation of Eigenvalues and Eigenvectors
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.
- When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.
Given the covariance matrix
of a Gaussian distribution
- The eigenvectors of
are the principal directions of the distribution
- The eigenvalues are the variances of the corresponding principal directions
The linear transformation defined by the eigenvectors of
leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).
- If the distribution is Gaussian, then the transformed vectors will be statistically independent.
Principal Component Analysis - July 9
Principal Component Analysis (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.
Rough definition
Keepings two important aspects of data analysis in mind:
- Reducing covariance in data
- Preserving information stored in data(Variance is a source of information)
PCA takes a sample of d - dimensional vectors and produces an orthogonal(zero covariance) set of d 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first k Principle Components and approximating the data in k - dimensional space, which is easier to analyze and plot.
Principal Components of handwritten digits
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset.
We can represent each image as a vector of length 644 (
) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call
and
. If we expand the matrix product then each image is approximated by a linear combination of the columns of V:
, where λ = [λ1,λ2]T is a column of W.
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.
More examples
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see Nonlinear dimensionality reduction).
- Handwritten digits.
- Recognition of hand orientation. (Isomap??)
- Recognition of facial expressions. (LLE - Locally Linear Embedding?)
- Arranging words based on semantic meaning.
- Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)
Derivation of the first Principle Component
We want to find the direction of maximum variation. Let
be an arbitrary direction,
a data point and
the length of the projection of
in direction
.
The direction
is the same as
so without loss of generality,
Let
be random variables, then our goal is to maximize the variance of u,
For a finite data set we replace the covariance matrix Σ by s, the sample covariance matrix,
is the variance of any vector
, formed by the weight vector
. The first principal component is the vector that maximizes the variance,
where arg max denotes the value of w that maximizes the function. Our goal is to find the weights
that maximize this variability, subject to a constraint.
The problem then becomes,
such that
Notice,
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.
Principal Component Analysis Continued - July 14
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes:
with constraint
.
Before we can proceed, we must review Lagrange Multipliers.
Lagrange Multiplier
To find the maximum (or minimum) of a function
subject to constraints
, we define a new variable
called a Lagrange Multiplier and we form the Lagrangian,
If
is the max of
, there exists
such that
is a stationary point of
(partial derivatives are 0).
In addition
is a point in which functions
and
touch but do not cross. At this point, the tangents of
and
are parallel or gradients of
and
are parallel, such that:
where,
the gradient of
the gradient of
Example
Suppose we wish to maximise the function
subject to the constraint
. We can apply the Lagrange multiplier method on this example; the lagrangian is:
We want the partial derivatives equal to zero:
Solving the system we obtain 2 stationary points:
and
. In order to understand which one is the maximum, we just need to substitute it in
and see which one as the biggest value. In this case the maximum is
.
Determining W
Back to the original problem, from the Lagrangian we obtain,
If
is a unit vector then the second part of the equation is 0.
If
is not a unit vector the the second part of the equation increases. Thus decreasing overall
. Maximization happens when
(Note that to take the derivative with respect to w below,
can be thought of as a quadratic function of w, hence the 2sw below. For more matrix derivatives, see section 2 of the Matrix Cookbook)
Taking the derivative with respect to w, we get:
Set
, we get
From the eigenvalue equation
is an eigenvector of S and
is the corresponding eigenvalue of S. If we substitute
in
we obtain,
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, u1 to have the maximum variance
(i.e. capturing as much variability in in
as possible.) Subsequent principal components will take up successively smaller parts of the total variability.
D dimensional data will have D eigenvectors
where each
represents the amount of variation in direction
so that
Note that the Principal Components decompose the total variance in the data:
i.e. the sum of variations in all directions is the variation in the whole data
Example from class
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images
The Matlab code is as follows:
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')
who
size(X)
imagesc(reshape(X(:,1),20,28)')
colormap gray
imagesc(reshape(X(:,1),20,28)')
[u s v] = svd(X);
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components
figure
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.
colormap gray
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face
The second one is the denoised image
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.
Principle Component Analysis (continued) - July 16
Application of PCA - Feature Abstraction
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.
Example: Comparing Images of 2s and 3s
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.
The matlab code is as follows.
load 2_3 %the size of this file is 64 X 400
[coefs , scores ] = princomp (X')
% performs principal components analysis on the data matrix X
% returns the principal component coefficients and scores
% scores is the low dimensional representatioation of the data X
plot(scores(:,1),scores(:,2))
% plots the first most variant dimension on the x-axis
% and the second highest on the y-axis
plot(scores(1:200,1),scores(1:200,2))
% same graph as above, only with the 2s (not 3s)
hold on % this command allows us to add to the current plot
plot (scores(201:400,1),scores(201:400,2),'ro')
% this addes the data for the 3s
% the 'ro' command makes them red Os on the plot
% If We classify based on the position in this plot (feature),
% its easier than looking at each of the 64 data pieces
gname() % displays a figure window and
% waits for you to press a mouse button or a keyboard key
figure
subplot(1,2,1)
imagesc(reshape(X(:,45),8,8)')
subplot(1,2,2)
imagesc(reshape(X(:,93),8,8)')
General PCA Algorithm
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).
Other Notes:
- The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see detailsPCA in Wikipedia.)
- Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding:
using mapping
.
- When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing:
using mapping
.
- We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.
Fisher's Linear Discriminant Analysis (FDA) - July 16(cont)
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses).
The number of dimensions that we want to reduce the data to, depends on the number of classes:
For a 2 class problem, we want to reduce the data to one dimension (a line),
Generally, for a k class problem, we want k-1 dimensions,
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).
The following diagram summarizes this goal.
In fact, the two examples above may represent the same data projected on two different lines.
Goal: Maximum Separation
1. Minimize the within class variance
and this problem reduces to
(where
are the covariance matrices for the 1st and 2nd class of data respectively)
Let
be the within classes covariance.
Then, this problem can be rewritten as
2. Maximize the distance between the means of the projected data
The optimization problem we want to solve is,
which is a scalar. Therefore,
(using the property of
Thus our original problem equivalent can be written as,
For a two class problem the between class variance is,
Then this problem can be rewritten as,
Objective Function
We want an objective function which satisifies both of the goals outlined above (at the same time).
-
or
-
or
We take the ratio of the two -- we wish to maximize
or equivalently,
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,
subject to
We solve the following Lagrange Multiplier problem,

Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21
As discussed in the previous lecture, our Optimization problem for FDA is:
which we turned into
Subject to:
where sB is the covariance matrix between classes and sw is the covariance matrix within classes.
Using Lagrange multipliers, we have a Partial solution to:
- The optimal solution for w is the eigenvector of
corresponding to the largest eigenvalue;
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues.
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:
Example:
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods.
%First of all, we generate the two data set:
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300)
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300)
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]
%The plot of the two distributions is:
%We compute the principal components:
X=[X1,X2]
X=X'
[coefs, scores]=princomp(X');
coefs(:,1) %first principal component
coefs(:,1)
ans =
0.76355476446932
0.64574307712603
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)
plot ([0 w(1)], [0 w(2)],'g')
%We now make the projection:
Xf=w'*X
figure
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"
hold on
plot(Xf(301:600),1,'or')
%We see that in the above picture that there is no overlapping
Xp=coefs(:,1)'*X
figure
plot(Xp(1:300),1,'b')
hold on
plot(Xp(301:600),2,'or')
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]
Classification - July 21 (cont)
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, Y, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable X is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)
Consider independent and identically distributed data points
where
and
and Y is a finite set of discrete values. The set of data points
is called the training set.
Then
is a classifier where, given a new data point
,
predicts
. The function
is found using the training set. i.e. the set trains
to map
to
:
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.
Error Rate
Given test points, how can we find the emperical error rate?
We simply count the number of points that have been misclassified and divide by the total number of points.
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.
Bayes Classification Rule
Considering the case of a two-class problem where
Where the denominator can be written as
So our classifier function
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate
. A real-life interpretation of the marginal
may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.
In the image, which set would it be more appropriate for the question mark to belong to?
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.
Decision Boundary
The Decision boundary is given by:
Suggesting those points where the probabilities of being in both classes are identical. Thus,
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.
Linear Discriminant Analysis(LDA) - July 23
Motivation
We would like to apply Bayes Classifiaction rule by approximating the class conditional density
and the prior
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,
- The class conditional density is multivariate gaussian
- The classes have a common covariance matrix
Derivation
- Note on Quadratic Form
-
By assumption (1),

where
is the class covariance matrix and
is the class mean. By definition of the decision boundary (decision boundary between class
and class
),
By assumption (2),
The result is a linear function of
of the form
.
Example:
%Load data set
load 2_3;
[coefs, scores] = princomp(X');
size(X)
% ans = 64 400 % 64 principal components
size(coefs)
% ans = 64 64
size(scores)
% ans = 400 64
Y=scores(:, 1:2);
% just use two of the 64 principal components
plot(Y(1:200, 1),Y(1:200, 2), 'b.')
hold on
plot(Y(201:400, 1),Y(201:400, 2), 'r.')
ll=[zeros(200,1),ones(200,1)];
[C,err,P,logp] = classify(Y, Y, ll', 'linear');
%The output of the function classify are:
% C, that is a vector whose values define the groups that are classified from the rows of the matrix
% err, that is an estimate of the misclassification error
% P, that gives the probabilities that the observation n belongs to the ith class
% logp, that gives the logaritm of the probability density of the nth observation. Where the latter is the sum of
% the conditional probability of the nth observation to belong to the class ith, times the probability of the
% class ith, taken over all classes.
Computational Method
We can implement this computationally by the following:
Define two variables,
and
To classify a point,
, first compute
and
.
Classify it to class
if
and vise versa.
(note: since
is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.)
Special Case:
, the identity matrix. Then,
We see that in the case
, we can simply classify a point,
, to a class based on the distances between
and the mean of the different classes (adjusted with the log of the prior).
General Case:
(since
is symmetric)

So,
where
and
Hence the approach taken should be to transpose point x from the beginning,
i.e. Let
Then compute
and
with x * , similar to the special case above.
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.
In the δ function calculations above, πk = P(X = k) and πl = P(X = l), and can be approximated using the proportions of k and l elements in the training set.
More on Quadratic Discriminant Analysis - July 28
The curse of dimensionality is a problem in many fields, including computer science and statistics. Loosely stated, it says that many problems in computer science and statistics become impractical as the number of dimensions increases.
For example, the number of points needed to accurately estimate parameters increases with the number of parameters. For linear discriminant analysis we have to estimate
elements of the covariance matrix and
elements for the mean, for a total of
. For quadratic discriminant analysis, we have to estimate a second covariance matrix, so we estimate
parameters in total. Clearly we have to estimate more parameters for QDA than LDA, so we need more points to get an accurate estimate for the discriminant.
Estimation:
We can find very flexible boundaries with LDA by expanding the input domain. For example, suppose
has two features and can be represented as
then,
We can expand
to
and get a quadratic boundary.
We can expand
to
and get a cubic boundary.
ect..
In Matlab:
%To double the input:
load 2_3;
[U, Y] = princomp(X');
ll=zeros(N,1);
Y1=[Y Y.^2];
[C,err,p,logp]=classify(Y1,Y1,ll,"linear")
%Now it classify using not only y but also y^2.
























