October 14, 2018

Exploring the Cost Function of Logistic Regression

This post is about exploring the cost function and its connection to the logistic regression function.

knitr::opts_chunk$set(echo = TRUE, warning=F, message=FALSE, fig.align='center')
library(tidyverse)
library(plotly)
library(latex2exp)

I want to get a deeper understanding of the connection between the logistic regresssion and its cost function. Therefore I created a function in R and conducted a grid approximation with this function. The results are presented below.

The logistic regression can be applied to data where the dependent variable is coded binary where the referent class is coded as a 1 and as 0 otherwise. We can model the relationship of this binary outcome variable with metric and categorical predictor variables. As a result, we can then compute the predicted probability that a datapoint is a member of the referent class coded as 1.

The logistic regression model with an intercept and one depentent variable is as follows:

\[\log(\frac{p(x)}{1 - p(x)}) = \beta_0 + \beta_1x_i\]

This formula models the Logarithm of the Odds-Ratio. This is the logistic regression model with an intercept and one predictor variable. To find the \(\beta\)-Coefficients that fit the data best we optimize the following cost-function, the Log-Likelihood function.

\[Cost(p(x), y) = \frac{1}{n}\sum{-y_i \log(p(x_i) - (1 - y_i)\log(p(x_i)))}\]

Since the linear equation of logistic regression predicts the Logit, this equation can be rearranged to get a prediction of \(p(x)\) which is basically the probability that a observation belongs to the reference group:

\[p(x) = \frac{1}{(1 + e^{-(\beta_0 + \beta_1x_i)})}\]

# Logistic function to predict p(x)
logistic_function <- function(D, b) {
  1 / (1 + (1 / exp(D %*% b)))
}

We can then create the cost function and put the logistic-function in the cost-function.

# Cost function for logistic regression
logistic_cost <- function(y, D, b){
  y_hat <- logistic_function(D, b)
  1/length(y_hat) * sum(-y * -log(y_hat) - (1 - y) * -log((1 - y_hat)))
}

To conduct the grid approximation, I write another function that takes the grid containing parameters, the predictor matrix and the outcome variable y as arguments and them computes the cost for every parameter combination. The function returns a matrix containing the grid and the corresponding cost.

# Compute the cost for different combinations of regression weights
# Returns the costs in a vector
grid_approx_logistic <- function(grid, data, outcome) {
  cost <- NULL
  for(i in seq(nrow(grid))) {
    betas <- as.numeric(grid[i, ])
    betas <- create_weight_matrix(betas)
    cost[i] <- logistic_cost(y = outcome, D = data, b = betas)    
  }
  cost_grid <- as.matrix(cbind(grid, cost))
  return(cost_grid)
}

# Create helper function to construct the weight matrix given a vector with proposed weights
# Used in the grid_approx_logistic function
create_weight_matrix <- function(weights) {
  betas <- matrix(weights, nrow = length(weights))
  return(betas)
}

I will use the mtcars dataset and I choose to model the probability beeing an automatic car as a function of miles per hour. First I set the grid with plausible values (I got them from running the glm function in the first place).

After that I create the predictor matrix, containing the intercept and the mpg data, and subset the variable am as outcome \(y\).

# create grid
grid <- expand.grid(seq(-20, 0, length.out = 50), seq(0, 1, length.out = 50))

# set intercept
mtcars$intercept <- 1
# subset prediction matrix
data <- as.matrix(mtcars[, c("intercept", "mpg")])
# subset dependent variable
y <- mtcars$am

Next I run the grid_approx_logistic function which computes the cost for every combination.

# compute cost for each parameter combination
cost <- grid_approx_logistic(grid = grid, data = data, outcome = y)

I visulize the result with a combination of a raster and contour plot to visualize the parameter combination with the hightes likelihood.

Furthermore I added the regression weights computed by the glm function.

# compute the logistic regression with base r function glm
model <- glm(am ~ mpg, family = "binomial", data = mtcars)  
coefs <- as.data.frame(map(coef(model), list)) %>% set_names(c("beta0", "beta1"))

# visualize the regession combinations vs the computed cost
as.data.frame(cost) %>% 
  ggplot(aes(x = Var1, y = Var2, z = cost, color = cost, fill = cost)) +
  geom_raster(interpolate = FALSE) +
  geom_contour(bins = 30) + 
  geom_point(data = coefs, aes(x = beta0, y = beta1), size = 4, color = "white", inherit.aes = F) +
  geom_label(data = coefs, aes(x = beta0, y = beta1), 
             label = TeX("$\\beta_0, \\beta_1 = (-6.6, 0.3)$"),
             inherit.aes = F,
             nudge_x = 1, nudge_y = .05, size = 5) +
  theme_classic() +
  scale_fill_continuous("Cost-Function") +
  scale_color_continuous("Cost-Function") +
  xlab(TeX("$\\beta_0$ (Intercept)")) +
  ylab(TeX("$\\beta_1$ (Regressionsgewicht für Miles per Hour)")) +
  ggtitle("Maximizing the negative Log-Likelihood Function", sub = "Grid Approximation") +
  theme(text = element_text(family = "Times"),
        title = element_text(size = 14), 
        axis.text = element_text(size = 12))

We can see that the area with the hightest density, resp highest likelihood, correspondes with the computed coefficents.

However, we can also see, that there is a area corresponding to low values of the cost function which maybe leads to unstable estimates of the regression coeficients.

Since I started with a grid that mirrors the result of the glm function, grid approximation is often a bad approach to find the best parameter combination of a logistic model.

Other approaches use the gradient decent algorithm, which is computational much cheaper and therefore faster.

© Jens Hüsers 2018

Powered by Hugo & Kiss.