Logistic Regression

What?

TL;DR: It’s a giant learned weighted sum (the weights are learned). Comes in single-class and multi-class variants. The result is used for assigning probability for classification classes.

As opposed to Linear Regression - which predicts real values - Logistic Regression is all about predicting binary response variables. Often they predict Odds and Log Odds.

The Core Idea:

👉 Logistic regression doesn’t actually “regress” like linear regression. Instead, it models $P (y ∣ x)$ ; essentially:

Given some input features $x$ , what is the probability of class $y$

Essentially, each feature $f_{i} (x)$ contributes a “vote” for different classes.
The model adds up the weighted votes using the dot product (the part in the main brackets).
Then, the Softmax Function (the $\frac{e x p ( everything else )}{Z}$ ) ensures that the votes get converted into probabilities
The part behind the highest probability wins.

Mathematically, Single Class:

The core idea is to model the probability that an input $x$ belongs to the positive class (class 1) using the sigmoid function:

P (y = 1 ∣ x) = σ (w \cdot x + b) = \frac{1}{1 + e ^{- (w \cdot x + b)}}

Where:

$x$ = input feature vector (e.g., size of house, age, etc.)
$w$ = weight vector (learned parameters)
$b$ = bias (intercept)
$σ$ = sigmoid function
$P (y = 1 ∣ x)$ = probability the output is class 1 (positive)

You then classify:

\overset{y}{^} = {10 if P (y = 1 ∣ x) \geq 0.5 otherwise

Funnily: This is actually one of the bases for a neuron of a neural network (and also a straight line when you simplify it)

Think 🤔: What would happen if you upped every single weight by 1? Well relative to each other, they’d still all be in the same place. But the actual probability of each would be skewed. Duh…

Mathematically, Multi-class:

Choose the class that has highest probability according to

$P (y = k ∣ x) = \frac{1}{Z} exp (\sum_{i} w_{i}^{(k)} f_{i} (x))$

where the normalisation constant

$Z = \sum_{k^{'}} exp (\sum_{i} w_{i}^{(k^{'})} f_{i} (x))$

Inside brackets is just a dot product: $s^{(k)} = w^{(k)} \cdot f (x)$ . Essentially, this is the weight each feature brings to the class
$Z$ does not depend on $k$ .
So, we will end up choosing class $k$ for which $s^{(k)}$ is highest.
Softmax function: exponentiation of scores $s$ , followed by normalisation to turn into a distribution.

Note: There’s no bias here. It’s removed for simplicity in derivations, but often there in practise.

Predicting By Hand:

Take the following example:

$\hat{β}_{0} = 1.0$
$\hat{β}_{1} = 0.5$
$\hat{β}_{2} = - 0.5$
$\hat{β}_{3} = 0.1$
$x^{(1)}$ Number of occurrences of phrase “world-beating”
$x^{(2)}$ Number of occurrences of phrase “confidence interval”
$x^{(3)}$ Number of occurrences of phrase “bootstrap”
$y$ Whether the paper was rejected (1) or sent out for review (0).
Question: Suppose a paper contains the phrase “world-beating” 5 times, and 0 occurrences of “confidence interval” or “bootstrap”. What is the predicted probability of rejection?

Solution:

To calculate the predicted probability of rejection for a paper with 5 occurrences of “world-beating” and 0 occurrences of “confidence interval” and “bootstrap”, we use the logistic regression coefficients and the following formula:

Log-odds = $\hat{β}_{0} + \hat{β}_{1} \cdot x_{1} + \hat{β}_{2} \cdot x_{2} + \hat{β}_{3} \cdot x_{3}$

From there, we sigmoid it to get probability.

Here, $x_{1}$ is the number of occurrences of “world-beating”, $x_{2}$ is the number of occurrences of “confidence interval”, and $x_{3}$ is the number of occurrences of “bootstrap”. Given the coefficients $\hat{β}_{0} = 1.0$ , $\hat{β}_{1} = 0.5$ , $\hat{β}_{2} = - 0.5$ , and $\hat{β}_{3} = - 0.1$ , and the occurrences $x_{1} = 5$ , $x_{2} = 0$ , and $x_{3} = 0$ , the log-odds are:

Log-odds = $1.0 + (0.5 \cdot 5) + (- 0.5 \cdot 0) + (- 0.1 \cdot 0)$
Log-odds = $3.5$

To convert the log-odds to a probability, we use the logistic function:

$P (rejection) = \frac{e ^{Log-odds}}{1 + e ^{Log-odds}}$

Plugging in the log-odds we calculated:

$P (rejection) = \frac{e ^{3.5}}{1 + e ^{3.5}}$

Training Logistic Regression:

Like any Machine Learning Model:

Iterate over a subset of your data
Predict the classes / output of that model.
Use a loss function (BCE in this case) to see how incorrect you were:
Calculate which direction you should take to get more correct.
Take steps in the correct direction.

4. Calculate Which Direction You Should Take:

You calculate the direction of the steepest loss decrease. To do this, we’ll use Gradient Descent! We’ll use the following values:

Gradient of loss function w.r.t weights is:
$\frac{\partial L}{\partial w _{i}} = (\overset{y}{^} - y) x_{i}$
Gradient of loss w.r.t. the bias:
$\frac{\partial L}{\partial b} = (\overset{y}{^} - y)$

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter