Notes on Probability Basics

Comparison between probability theory and mathematical statistics:

Probability theory:

Probability theory uses random variables, their probability distributions, numerical characteristics, and characteristic functions as mathematical tools to describe and analyze random phenomena. Its premise is that the probability distribution of the random variable is assumed to be known.
The method of probability theory is deductive: starting from assumptions, propositions, and known facts about random phenomena, conclusions are obtained through logical reasoning.

Mathematical statistics:

In mathematical statistics, the probability distribution of the random variable is unknown, or the distribution type is known but its parameters are unknown.
Statistical methods randomly sample part of the population under study, conduct experiments or observations, obtain experimental data, and then make inferences about the whole. This is an inductive method.

Probability Theory

1. Random Phenomena

A random experiment is an experiment satisfying the following conditions:
1. The experiment can be repeated under the same conditions.
2. There is more than one possible outcome, and all possible outcomes are known before the experiment.
3. Each trial may produce one of the possible outcomes, but the specific outcome is unknown before the trial.
A random variable, or random event in this context, is a possible experimental outcome in a random experiment and has randomness.
- A random variable can be understood as a numerical extension of random events. By introducing the distribution function of a random variable, we can study it with calculus.
Formal definition of a random variable:
- Let $E$ be a random experiment, and let the sample space, namely all possible sample outcomes, be $S$. For each elementary event $e$ in $S$, if there is a unique real value $X(e)$ corresponding to it, then $X(e)$ is a random variable.
- Random variables include discrete random variables and continuous random variables.
In a random experiment, some outcomes are usually more likely to occur. This likelihood exists objectively and is called the probability $P(A)$ of the random event.
The distribution of a random variable is described by the probability that each event $e$ in the variable space $S$ occurs. The probabilities of all these events describe the distribution of the random event.
- When we already know that a random phenomenon follows a certain distribution, we can directly compute the probability that its random variable occurs, without performing additional statistical analysis. For example, if we know that the number of customers arriving at a store follows a Poisson distribution, or that class scores follow a normal distribution, then this distribution is prior knowledge and can be used directly. Of course, even if we know the distribution type, there may still be unknown parameters in that distribution, and these parameters need to be estimated by statistical analysis.
- A discrete random variable can be described directly by a probability mass function, giving the probability $P(X=e)$ of a specific event $e$.
- Concepts related to continuous random variables:
  - For a continuous random variable, the probability of taking one exact value is not meaningful. We usually care about the probability that it falls within an interval. For example, the probability that a light bulb lasts exactly 1.5 years and the probability that it lasts exactly 1.52 years do not carry much direct information. We usually compute the probability that its lifetime lies between 1.5 and 2 years.
  - Solving interval probabilities uses the distribution function:
    $$ P(x_1\le X\le x_2)=P(X\le x_2)-P(X\le x_1) $$
    For convenience, write $F(x)=P(X\le x)$ as the distribution function of $X$. A distribution function can be treated as an ordinary function and handled with calculus.
  - The derivative of the distribution function, $f(x)=F'(x)$, is the probability density of the random variable $X$. In fact, for a continuous event $X=x$, the value of the probability density function $f(x)$ is not the probability that the event occurs. For any continuous random variable, $P(X=x)=0$, so taking one single value is not meaningful.
  - Looking at the graph of a continuous random variable, although the vertical axis is the probability density $f$, the actual probability is represented by the area over an interval, not by the specific density value $f(x)$ at one point.
- Normal distribution:
  $$ f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$$$ F(x)=\frac{1}{\sigma\sqrt{2\pi}}\int_{-\infty}^x e^{-\frac{(t-\mu)^2}{2\sigma^2}}\,dt $$

2. Conditional Probability and Independence

Describing relationships among random events.

Conditional probability.

Conditional probability $P(A|B)$ is the probability that random event $A$ occurs under the condition that random event $B$ has occurred. Strictly speaking, the occurrence of most random events is conditional.

$$ P(A|B)=\frac{P(AB)}{P(B)} $$

$P(AB)$ denotes the probability that random events $A$ and $B$ occur simultaneously. Thus, we can derive the multiplication rule:

$$ P(AB)=P(A|B)P(B) $$

Law of total probability.

For random events $A$ and $B$, $A\subset B$ means that the occurrence of event $A$ necessarily leads to the occurrence of event $B$, that is, $A$ is a subset of $B$.

Suppose $B\subset A_1+A_2+\cdots+A_n$, where $A_1,\ldots,A_n$ are mutually exclusive and cannot occur simultaneously. This means that if event $B$ occurs, then one of $A_1,\ldots,A_n$ must occur. Then

$$ P(B)=\sum_{i=1}^n P(A_i)P(B|A_i) $$

This is the law of total probability. It is mainly used to compute the probability of a more complex unknown event through simpler event probabilities.

Note that the law of total probability is usually used when random events $B$ and $A_i$ can be divided into two major categories of events and satisfy the following conditions:

$\cup A_i$ can include all possible cases in which the relevant events occur, although it does not necessarily have to be the entire sample space.
The events $A_i$ must be mutually exclusive.
Events $B$ and $A_i$ can be divided into two categories. For example, $B$ may denote “the product is defective”, and $A_i$ may denote “the product comes from the $i$-th factory”.
Event $B$ must be contained in $\cup A_i$.

Usually, because the $A_i$ are mutually exclusive, $\cup A_i$ is directly treated as the full set of possible cases for category $A$. Otherwise, the formula is not very useful.

Bayes’ Rule

First, the formula. Under the same conditions as the law of total probability, where $A_1,\ldots,A_n$ are mutually exclusive and event $B$ is contained in $\cup A_i$:

$$ P(A_i|B)=\frac{P(A_i)P(B|A_i)}{\sum_{j=1}^nP(A_j)P(B|A_j)} $$

Bayes’ rule is a very important formula. Since $B\subset A_1+\cdots+A_n$, we can regard $A_i$ as one possible cause of event $B$. Then $P(A_i|B)$ asks for the probability that event $A_i$ caused event $B$ to occur.

In the formula:

$P(A_i)$ is called the prior probability, which is usually known before the random experiment.
$P(A_i|B)$ is the posterior probability, namely the probability we want to compute. It reflects how the likelihood of each possible cause changes after observing the experiment. For example, after knowing that someone was admitted to a university, we may judge how likely it is that the person is intelligent.
$P(B|A_i)$ is called the inverse conditional probability, and it is also the likelihood term used in the formula.

Bayes’ rule in machine learning classification problems, as discussed in Hung-yi Lee’s course:

For all machine learning problems, the training process can be considered from two perspectives:

Treat the model as a prediction function. Training minimizes the loss function to obtain the corresponding parameters $w$. Prediction takes the feature value $x$ of a new sample as input and outputs the label value $y$.
Treat the model as a probability density function that represents the data distribution. Training is then a process of parameter estimation for the probability distribution. Prediction becomes computing the conditional probability $P(y=?|x)$, which represents the probability of class $y=?$ given input feature $x$.
- Let the features be $X=x_1,\ldots,x_n$, and the possible categories be $Y=y_1,y_2,y_3$. The Bayes model computes $P(Y=y_j|X)$.
- According to Bayes’ rule:
  $$ P(Y=y_j|X)=\frac{P(Y=y_j,X)}{P(X)} =\frac{P(X|Y=y_j)P(Y=y_j)}{P(X)} $$
  Therefore, training a Bayes model means estimating the prior probability $P(Y=y_j)$ and the likelihood probability $P(X|Y=y_j)$ from the training data.

3. Multidimensional Random Variables

Concept of a two-dimensional random variable.
Let $(X,Y)$ be a two-dimensional random variable. Denote the intersection of the event $\{X\le x\}$ and the event $\{Y\le y\}$ as $\{X\le x,Y\le y\}$. The bivariate function
$$ F(x,y)=P(X\le x,Y\le y) $$
is called the distribution function of $(X,Y)$, or the joint distribution function of $X$ and $Y$.
Here, it is worth emphasizing that in $P(XY)$, $X$ and $Y$ represent random events, and $XY$ represents the intersection of the two random events, meaning the probability that event $X$ and event $Y$ occur simultaneously. In $P(X=x,Y=y)$, $X$ and $Y$ represent random variables, more specifically discrete random variables. This also means the probability that random variable $X=x$ and random variable $Y=y$ occur simultaneously, and also represents an intersection.
For a two-dimensional discrete random variable, $p_{ij}=P(X=x_i,Y=y_j)$ is called the joint probability mass function of $(X,Y)$.
For a two-dimensional continuous random variable,
$$ F(x,y)=\int_{-\infty}^x\int_{-\infty}^y f(u,v)\,du\,dv $$
is the joint distribution function, and $f(x,y)$ is the joint probability density.
If we want the marginal probability of one dimension of the two-dimensional random variable $(X,Y)$ at $X=x$, it is equivalent to summing over all possible values of $Y$ at $X=x$. This forms the marginal distribution function of $X$:
$$ F_X(x)=P(X\le x)=P(X\le x,Y\le+\infty)=F(x,+\infty) $$
Independence of multidimensional random variables.
Let $F(x,y)$, $F_X(x)$, and $F_Y(y)$ be the distribution functions of $(X,Y)$, $X$, and $Y$, respectively. If for any $x,y$,
$$ F(x,y)=F_X(x)F_Y(y) $$
holds, then random variables $X$ and $Y$ are independent.
In particular, if $X$ and $Y$ are continuous random variables and their density satisfies
$$ f(x,y)=f_X(x)f_Y(y) $$
then $X$ and $Y$ are also independent.
Conditional distribution.
Note that this is the conditional distribution of random variables. It is not exactly the same concept as the conditional probability of random events above, although they are used in very similar ways.
For discrete random variables:
$$ P(X=x_i|Y=y_j) =\frac{P(X=x_i,Y=y_j)}{P(Y=y_j)} =\frac{p_{ij}}{p_{\cdot j}} $$
For continuous random variables:
$$ F_{X|Y}(x|y) =\int_{-\infty}^x f_{X|Y}(u|y)\,du =\int_{-\infty}^x\frac{f(u,y)}{f_Y(y)}\,du $$

4. Numerical Characteristics of Random Variables

The probability distribution of a random variable describes the statistical law of the random phenomenon and can fully describe the variable’s properties. However, in practice we often need concise numerical characteristics.

Mathematical expectation.
Expectation describes the “average value” produced by a random variable.
Here, we know the probability distribution and want to compute the expectation. But expectation should not be simply equated with the empirical “average”, because the average belongs to data statistics and is obtained from enough samples. Only under certain estimation settings, such as likelihood-based estimation, can the mathematical expectation be associated with the sample average.
Suppose $X$ can take values $\{X_1,X_2,X_3,X_4\}$. After $N$ experiments, the counts of these values are $\{n_1,n_2,n_3,n_4\}$. The average of the observed values can be written as
$$ \frac{\sum_{i=1}^4 n_i\cdot x_i}{N} $$
Here, $n_i/N$ can be viewed as the probability that $X_i$ occurs:
$$ P(X=x_i)=\frac{n_i}{N}=p_i $$
Therefore, the average can be expressed as
$$ \sum_{i=1}^4 x_i\cdot p_i $$
which is exactly the expectation.
The summation here involves the concept of a series. For the expectation to exist, the series must converge, or the corresponding integral must exist.
Mathematical expectation of a discrete random variable:
$$ E(X)=\sum_{i=1}^\infty x_i\cdot p_i $$
Mathematical expectation of a continuous random variable:
$$ E(X)=\int_{-\infty}^{+\infty}x\cdot f(x)\,dx $$
Mathematical expectation of a function of random variables:
Let $Z=g(X,Y)$, where $g(X,Y)$ is a continuous function.
If $(X,Y)$ is a two-dimensional discrete random variable:
$$ E(Z)=E[g(X,Y)] =\sum_{i=1}^{\infty}\sum_{j=1}^{+\infty}g(x_i,y_j)p_{ij} $$
If $(X,Y)$ is a two-dimensional continuous random variable:
$$ E(Z)=E(g(X,Y)) =\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}g(x,y)f(x,y)\,dx\,dy $$
Properties of mathematical expectation:
- $E(CX)=CE(X)$
- $E(X_1+X_2)=E(X_1)+E(X_2)$
- If $X_1$ and $X_2$ are independent, then $E(X_1X_2)=E(X_1)E(X_2)$.
Variance.
Variance represents the degree to which random variable $X$ deviates from $E(X)$. From a statistical perspective, variance represents the average fluctuation of data around the mean and can be used to evaluate data stability.
To measure average deviation, it is natural to use the mean of deviations. The deviation can be expressed as $|X-E(X)|$, so one might consider $E(|X-E(X)|)$. However, this expression is not convenient to compute. Therefore, squared deviation is used:
$$ E([X-E(X)]^2) $$
Definition of variance:
$$ D(X)=E([X-E(X)]^2) $$
Variance of a discrete random variable:
$$ D(X)=\sum_{i=1}^\infty [x_i-E(X)]^2p_i $$
Variance of a continuous random variable:
$$ D(X)=\int_{-\infty}^{+\infty}(x-E(X))^2f(x)\,dx $$
Common formula for variance:
$$ D(X)=E(X^2)-[E(X)]^2 $$
Properties of variance:
- $D(CX)=C^2D(X)$
- If $X_1$ and $X_2$ are independent, then $D(X_1+X_2)=D(X_1)+D(X_2)$.
- If $X$ and $Y$ are independent, then
  $$ D(XY)=D(X)D(Y)+D(X)[E(Y)]^2+D(Y)[E(X)]^2 $$
Covariance.
For a two-dimensional random variable $(X,Y)$, expectation and variance describe the properties of a single random variable. We need another quantity to describe the relationship between two random variables.
Definition of covariance:
$$ \operatorname{Cov}(X,Y)=E\{[X-E(X)][Y-E(Y)]\} $$
Common formula for covariance:
$$ \operatorname{Cov}(X,Y)=E(XY)-E(X)E(Y) $$
Properties of covariance:
- $\operatorname{Cov}(aX,bY)=ab\operatorname{Cov}(X,Y)$
- $\operatorname{Cov}(X_1+X_2,Y)=\operatorname{Cov}(X_1,Y)+\operatorname{Cov}(X_2,Y)$
Using covariance, we can supplement the variance relationship for any two-dimensional random variables $X,Y$. Regardless of whether $X$ and $Y$ are independent, we have
$$ D(X+Y)=D(X)+D(Y)+2\operatorname{Cov}(X,Y) $$
Correlation coefficient.
Using covariance to describe the relationship between random variables has two drawbacks:
- From the first property, the degree of correlation depends on the measurement units, or scale coefficients.
- From the definition, the value depends not only on the values of $X$ and $Y$ themselves, but also on the deviation of each random variable from its expectation, such as $X-E(X)$.
Therefore, we normalize the covariance of two random variables to describe their correlation:
$$ \rho = \frac{\operatorname{Cov}(X,Y)}{\sqrt{D(X)}\sqrt{D(Y)}} = \frac{E\{[X-E(X)][Y-E(Y)]\}}{\sqrt{D(X)}\sqrt{D(Y)}} $$
Properties of the correlation coefficient:
- $|\rho|\le 1$
- $|\rho|=1$ if and only if $P(Y=aX+b)=1$ for some constants $a,b$ with $a\ne 0$.
- If the correlation coefficient of $X$ and $Y$ is $\rho=0$, then $X$ and $Y$ are called uncorrelated. This is equivalent to:
  - $\operatorname{Cov}(X,Y)=0$
  - $D(X+Y)=D(X)+D(Y)$
  - $E(XY)=E(X)E(Y)$
For general random variables, being uncorrelated does not necessarily imply independence, although independence implies zero covariance when the relevant expectations exist.