Multiple logistic regression power analysis

I have a logistic regression model and output an $R^2$ value. I then go and add another predictor variable to fit a second model. I can output a new $R^2$ value associated with the second model. When I run an ANOVA test, I see no significant improvement in the second model, but I want to assess the power associated with including the additional variable in model 2. I have found an example for linear regression that uses an $F$-Test. I want to do something similar for a logistic regression using G*Power. But there appears to be very little documentation on multiple logistic regression models like my situation. I don't know how to do a more detailed power analysis for multiple logistic regression. From what I understand, in G*Power I set Test Family == z tests and statistical test == logisitic regression . But I am not sure what to set R² other X equal to. Is that the improvement in $R^2$? Reading the tutorial in 27.4 from the software manual makes no variation of $R^2$, whereas this example, does not discuss the improvements made from $R^2$.

regression
logistic
multiple-regression
statistical-power

81.7k 32 32 gold badges 199 199 silver badges 654 654 bronze badges asked Jul 20, 2015 at 13:43 431 2 2 gold badges 6 6 silver badges 17 17 bronze badges

$\begingroup$ R2 other X is probably not some "log reg pseudo R2", but it is the R-squared of the variable of interest with all the other covariables in the model, ignoring completely the response. $\endgroup$

Commented Mar 10, 2019 at 13:05

3 Answers 3

$\begingroup$

The problem is that there isn't really a $R^2$ for logistic regression. Instead there are many different "pseudo-$R^2$s" that may be similar to the $R^2$ from a linear model in different ways. You can get a list of some at UCLA's statistics help website here.

In addition, the effect (e.g., odds ratio) of the added variable, $x_2$, isn't sufficient to determine your power to detect that effect. It matters how $x_2$ is distributed: The more widely spread the values are, the more powerful your test, even if the odds ratio is held constant. It further matters what the correlation between $x_2$ and $x_1$ is: The more correlated they are, the more data would be required to achieve the same power.

As a result of these facts, the way I try to calculate the power in these more complicated situations is to simulate. In that vein, it may help you to read my answer here: Simulation of logistic regression power analysis - designed experiments.

Looking at G*Power's documentation, they use a method based on Hsieh, Bloch, & Larsen (1998). The idea is that you first regress $x_2$ on $x_1$ (or whatever predictor variables went into the first model) using a linear regression. You use the regular $R^2$ for that. (That value should lie in the interval $[0,\ 1]$.) It goes in the R² other X field you are referring to. Then you specify the distribution of $x_2$ in the next couple of fields ( X distribution , X parm μ , and Z parm σ ).

Hsieh, F.Y., Bloch, D.A., & Larsen, M.D. (1998). A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine, 17, 1623-1634.

answered Oct 13, 2015 at 21:24 gung - Reinstate Monica gung - Reinstate Monica 147k 89 89 gold badges 404 404 silver badges 712 712 bronze badges $\begingroup$

The excellent book Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models have a treatment of power analysis for logistic regression, with some simple useful (approximate) formulas, very possibly the formulas used by GPower referred in another answer (in section 5.7.) If those approximations are not good enough, probably simulation will be needed.

Two-sided testing of $H_0\colon \beta_j=0$ (log-odds scale) versus $H_1\colon \beta_j=\beta_j^a$ with level $\alpha$ and power $\gamma$ , standard deviation of predictor $x_j$ is $\sigma_$ , $p$ the marginal prevalence of the outcome and $\rho_j^2$ is the multiple correlation of $x_j$ with all the other predictors (this is the R-squared reported by a linear multiple regression of $X_j$ as response on all the other predictors, and do not involve the response in the logistic regression at all.)

The minimum sample size is then $$ n=\frac<(z_+z_\gamma)^2><(\beta_j^a \sigma_)^2 p(1-p) (1-\rho_j^2)> $$ where $z_$ and $z_\gamma$ are quantiles of the standard normal distribution corresponding to level and power. Note the use in this formula of the variance inflation factor $\text_j=\frac1$ .

A graph showing minimum sample size as function of alternative coefficient $\beta_j^a$ :

Minimum sample size as function of alternative coefficient

For completeness some related formulas from the same source:

If sample size $n$ is decided then power is $$ \gamma=1-\Phi\left(z_-|\beta_j^a| \sigma_x\sqrt\right)$$ where $\Phi$ is the standard normal cumulative distribution function. The minimum detectable effect (on log-odds scale) is $$ \pm \beta_j^a = \frac\sqrt> $$ The references given for this approximate formulas is A SIMPLE METHOD OF SAMPLE SIZE CALCULATION FOR LINEAR AND LOGISTIC REGRESSION which in turn for much of the theory refers to Sample Size for Logistic Regression With Small Response Probability which bases its result on approximations to the Fisher information matrix, so this is really based on normal approximations. It is known that normal approximations can be bad for logistic regression, so the results from this formulas should probably be checked by simulation.

min_n

min_n(beta_a=0.2, sigma_x=1, p=0.5, R2=0.5) [1] 1570