因此,在中风预测数据集中,我为所有分类变量创建了虚拟变量,即gender_more和gender_moemale、smoking_statussmoks和smoking_tatus_nunknown等。现在,为了检查所有变量(数字和虚拟)的多重共线性,我应用了方差膨胀函数:
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = new_df.loc[:, new_df.columns != 'stroke'].columns
vif_data["VIF"] = [variance_inflation_factor(new_df.loc[:, new_df.columns != 'stroke'].values, i) for i in range(len(new_df.loc[:, new_df.columns != 'stroke'].columns))]
vif_data
我得到的输出如下:
feature VIF
0 age 2.836394
1 hypertension 1.111484
2 heart_disease 1.113943
3 avg_glucose_level 1.107552
4 bmi 1.342729
5 gender_Female inf
6 gender_Male inf
7 ever_married_No inf
8 ever_married_Yes inf
9 work_type_Govt_job inf
10 work_type_Never_worked inf
11 work_type_Private inf
12 work_type_Self-employed inf
13 work_type_children inf
14 Residence_type_Rural inf
15 Residence_type_Urban inf
16 smoking_status_formerly smoked inf
17 smoking_status_never smoked inf
18 smoking_status_smokes inf
有人能解释一下为什么伪变量的vif是无穷大吗?有没有更好的方法来检查多重共线性?谢谢