Search⌘ K
AI Features

Logistic Regression

Explore the fundamentals of logistic regression, a key classification algorithm that predicts class probabilities using the sigmoid function. Understand how to model parameters with maximum likelihood estimation and apply this approach to multi-class problems using one-versus-all strategy. This lesson prepares you to implement logistic regression and optimize coefficients via gradient ascent for better prediction accuracy.

Moving towards class probability

Now, you will learn a classification algorithm, logistic regression. In this lesson, we will work with probabilities.

Instead of predicting the outcome directly, we can predict the probability of that outcome.

In the linear model, we used score tt to get the class or output yy. If the score is greater than zero, we say that the class is positive. If the score is less than zero, we say that the class is negative. The score can range between negative infinity to positive infinity.

If the score approaches positive infinity (or a very high positive value), we can say with confidence that the class is positive. If the score approaches negative infinity (or a very high negative value), we can say with confidence that the class is negative. If the value is near zero, the answer is uncertain.

So, we have to map the score to a probability range [0,1][0,1].

We can use a link function to do this transformation. This function takes the score and maps it to probability space [0,1].

Logistic function

We can use different types of functions to map score to a probability scale. The Sigmoid function is an example of a logistic function. It is defined as:

Sigmoid  =11+eScoreSigmoid \; = \frac{1}{1 + e^{-Score}}

The score value and function value are given like this:

widget

Our objective in logistic regression is to find the coefficient beta in a way that can predict the right class with high probability.

Sigmoid  =11+eβ0+β1xi[1]+β2xi[2]+...+βkxi[k]Sigmoid \; = \frac{1}{1 + e^{- \beta_0 + \beta_1 * x_i[1] + \beta_2 * x_i[2] + ...+ \beta_k * x_i[k]}}

The logistic regression function refers to learning the parameters of beta weights that maximize quality.

Quiz: Sigmoid function

1.

What is the output of the Sigmoid function if the score is 3?

A.

0.75

B.

0.85

C.

0.95

D.

1


1 / 1

Handling multiclasses (more than 2 classes)

Multiclasses are problems with more than two classes, and many real-world problems are multiclasses. For example, the original iris dataset has three classes, and we had to predict the right one with the help of given features.

We create a multiclass classification model using the one versus all strategy. We train C classifiers (C is the number of classes). Each classifier has data that belongs to class i, and all others put in another category. We train the classifier. Similarly, all C classifiers are trained. At the time of prediction, we get the probability of each class and output the class, which has a maximum probability.

For example, let’s say we have three classes: {Cat, Dog, Horse}

  • Classifier 1: Cat and Other dataset
  • Classifier 2: Dog and Other dataset
  • Classifier 3: Horse and Other dataset

When we get the example, we run it through all the classifier models. The example belongs to whichever classifier gives the highest probability.

Logistic regression parameters

The logistic regression model means parameters or coefficients are used to predict the probability of an example. To do so, we have to define a quality metric. We use maximum likelihood estimation or MLE.

If we have examples from two-class, then we want the probability of predicting an example with a positive class near 1 when the example belongs to the positive class. Similarly, we want the probability of predicting an example with the positive class to be near zero when the example belongs to a negative class.

Now, let’s understand the MLE with an example. We will revisit the iris dataset problem.

sepal_length sepal_width petal_length petal_width Species
6.7 3.3 5.7 2.1 virginica
6.4 3.1 5.5 1.8 virginica
4.9 3.1 1.5 0.1 setosa
5.5 3.5 1.3 0.2 setosa

For simplicity, assume virginica as +1 and setosa as -1. Our model should give a maximum probability of a positive class when the species is virginica. It should also give a maximum probability of negative class when the species is setosa.

sepal_length sepal_width petal_length petal_width Species Maximize probability
6.7 3.3 5.7 2.1 +1 P(y=+1|x1 β)
6.4 3.1 5.5 1.8 +1 P(y=+1|x2 β)
4.9 3.1 1.5 0.1 -1 P(y=-1|x3 β)
5.5 3.5 1.3 0.2 -1 P(y=-1|x4 β)

The likelihood is the multiplication of all these probability terms:

L(β)=P(y=+1x1,β)P(y=+1x2,β)P(y=1x3,β)P(y=1x4,β)L(\beta) = P(y=+1|x_1,\beta) * P(y=+1|x_2,\beta) * P(y=-1|x_3,\beta) * P(y=-1|x_4,\beta)

L(β)=i=1NP(yiXi,β)L(\beta) = \prod_{i=1}^{N} P(y_i|X_i,\beta)

So, our goal is to pick the value of parameters in a way that the above function value is at its largest.

This can be done with the Gradient Ascent algorithm. We first start with the random value of parameters and move closer to the solution by maximizing the likelihood.

Gradient Ascent algorithm

We can think of this as a hill-climbing approach for finding the maximum value.

While not converging

βT+1=βT+StpdL(β)/d(β)\beta_{T+1} = \beta_T + Stp * dL(\beta)/d(\beta)

StpStp is a step size. It can be fixed, or we can change it dynamically as the algorithm proceeds. In practical scenarios, we stop when the derivative is near zero.

Quiz: Step size in Gradient Ascent algorithm

1.

Mark all correct options Multi-select

A.

If the step size is small, the likelihood to iteration curve is more smooth.

B.

If the step size is small, it will take a long time to converge.

C.

If the step size is too large, the likelihood of iteration curve has fewer oscillations.

D.

If the step size is too small, the likelihood of iteration curve has fewer oscillations.


1 / 1

Quiz: Select True or False.

1.

Logistic regression can be used to solve multi-class prediction problems.

A.

True

B.

False


1 / 3

Quiz: Logistic regression

1.

What is the probability range for the class in logistic regression?

A.

[0,1]

B.

(-∞, ∞)

C.

[0,∞)

D.

(-∞,1]


1 / 2
1.

What is the benefit of normalizing features before applying logistic regression?

Show Answer
1 / 4