# week4 **Repository Path**: sika0819/week4 ## Basic Information - **Project Name**: week4 - **Description**: 第四周作业 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-04-27 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README 1. 对连续型特征,可以用哪个函数可视化其分布?(给出你最常用的一个即可),并根据代码运行结果给出示例 答:可以用distplot ```python sns.distplot(data["AGE"], bins=30, kde=True) plt.xlabel("Age", fontsize=12) plt.show() ``` 效果:![distplot](distplot.png) 2. 对两个连续型特征,可以用哪个函数得到这两个特征之间的相关性?根据代码运行结果,给出示例。 答:两个连续性也正可以用相关矩阵来查看她们之间的相关性,可用DataFrame的corr()方法先计算出每对特征间的相关矩阵,然后用seaborn的heatmap()方法来进行可视化。 ```python data_corr = data[numberic_col].corr() sns.heatmap(data_corr,annot=True) ``` 效果:![heatmap](heatmap.png) 3. 如果发现特征之间有较强的相关性,在选择线性回归模型时应该采取什么措施。 答:如果特征之间高度相关,可考虑进行PCA降维(特征层面)或加正则项(模型层面)。 4. 当采用带正则的模型以及采用随机梯度下降优化算法时,需要对输入(连续型)特征进行去量纲预处理。课程代码给出了用标准化(StandardScaler)的结果,请改成最小最大缩放(MinMaxScaler)去量纲 (10分),并重新训练最小二乘线性回归、岭回归、和Lasso模型(30分) 不同特征取值范围不一样,需要去量纲:MinMaxScaler去量纲以后保持稀疏性,StandardScaler是数据标准化,减去均值(去中心) ```python from sklearn.preprocessing import MinMaxScaler ss_X = MinMaxScaler() ss_y = MinMaxScaler() ss_log_y = MinMaxScaler() X = ss_X.fit_transform(X) y = ss_y.fit_transform(y.reshape(-1, 1)) log_y = ss_y.fit_transform(log_y.values.reshape(-1, 1)) ``` 最小二乘: ```python from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(X_train, y_train) y_test_pred_lr = lr.predict(X_test) y_train_pred_lr = lr.predict(X_train) fs = pd.DataFrame({"columns":list(feat_names), "coef":list((lr.coef_.T))}) fs.sort_values(by=['coef'],ascending=False) print('The r2 score of LinearRegression on test is', r2_score(y_test, y_test_pred_lr)) print('The r2 score of LinearRegression on train is', r2_score(y_train, y_train_pred_lr)) ``` 结果: ```python The r2 score of LinearRegression on test is 0.6939789810509471 The r2 score of LinearRegression on train is 0.7549146436868177 ``` 岭回归调参: ```python from sklearn.linear_model import RidgeCV n_alphas = 20 alphas = np.logspace(-5,2,n_alphas) ridge = RidgeCV(alphas=alphas, store_cv_values=True) ridge.fit(X_train, y_train) y_test_pred_ridge = ridge.predict(X_test) y_train_pred_ridge = ridge.predict(X_train) print('The r2 score of RidgeCV on test is', r2_score(y_test, y_test_pred_ridge)) print('The r2 score of RidgeCV on train is', r2_score(y_train, y_train_pred_ridge)) ``` 结果: ``` The r2 score of RidgeCV on test is 0.6993594964254004 The r2 score of RidgeCV on train is 0.7545292954339179 ``` 岭回归 ```python from sklearn.linear_model import LassoCV n_alphas = 20 alphas = np.logspace(-5,2,n_alphas) lasso = LassoCV(alphas=alphas) lasso.fit(X_train, y_train) y_test_pred_lasso = lasso.predict(X_test) y_train_pred_lasso = lasso.predict(X_train) print ('The r2 score of LassoCV on test is',r2_score(y_test, y_test_pred_lasso)) print ('The r2 score of LassoCV on train is',r2_score(y_train, y_train_pred_lasso)) ``` 结果: ``` The r2 score of LassoCV on test is 0.6941699179894809 The r2 score of LassoCV on train is 0.7549117457442763 ``` 可以看出调了参数以后比直接最小二乘好一些,但是差别不是很大。岭回归在真知表现最好,之前数据探索也说明,数据直接存在一些强相关(但不是特别大)数据也不是稀疏型. 5. 代码中给出了岭回归(RidgeCV)和Lasso(LassoCV)超参数(alpha)调优的过程,请结合两个最佳模型以及最小二乘线性回归模型的结果,给出什么场合应该用岭回归,什么场合用Lasso,什么场合用最小二乘。 答:在特征不存在强相关的情况,已知模型函数线性关系,直接用最小二乘即可。如果模型未知,用Lasso,Ridge.数据不稀疏,和结果均有关系时,用Ridge。如果数据稀疏,不一定和最终结果有关系,则用Lasso