8.1. 介绍¶
回归模型的评价与检验是机器学习中的重要环节。本实验将介绍各种回归模型的评价指标、统计检验方法,以及如何诊断模型的假设条件是否满足。
8.2. 知识点¶
- 回归模型评价指标
- 残差分析
- 统计假设检验
- 模型诊断方法
- 交叉验证
- 模型比较与选择
# 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston, make_regression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy import stats
from scipy.stats import jarque_bera, shapiro, normaltest
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_breusch_godfrey
import warnings
warnings.filterwarnings('ignore')
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
print("库导入成功!")
8.3. 数据准备¶
我们使用波士顿房价数据集来演示回归模型的评价与检验:
# 加载数据
boston = load_boston()
X = boston.data
y = boston.target
print(f"数据集形状: {X.shape}")
print(f"特征名称: {boston.feature_names}")
# 创建DataFrame
df = pd.DataFrame(X, columns=boston.feature_names)
df['PRICE'] = y
# 数据基本信息
print("\\n数据集基本信息:")
print(df.describe())
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"\\n训练集大小: {X_train_scaled.shape}")
print(f"测试集大小: {X_test_scaled.shape}")
8.4. 回归模型评价指标¶
8.4.1. 基本评价指标¶
- 均方误差 (MSE):
- 均方根误差 (RMSE):
- 平均绝对误差 (MAE):
- 决定系数 (R²):
8.4.2. 调整决定系数¶
- 调整R²:
其中 是样本数, 是特征数。
# 训练不同的回归模型
models = {
'线性回归': LinearRegression(),
'岭回归': Ridge(alpha=1.0),
'LASSO回归': Lasso(alpha=0.1)
}
# 计算评价指标
def calculate_metrics(y_true, y_pred, n_features):
"""计算各种评价指标"""
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
# 调整R²
n = len(y_true)
r2_adj = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
return {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R²': r2,
'R²_adj': r2_adj
}
# 训练和评估模型
results = {}
for name, model in models.items():
# 训练模型
model.fit(X_train_scaled, y_train)
# 预测
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
# 计算指标
train_metrics = calculate_metrics(y_train, y_pred_train, X_train_scaled.shape[1])
test_metrics = calculate_metrics(y_test, y_pred_test, X_train_scaled.shape[1])
results[name] = {
'model': model,
'train_metrics': train_metrics,
'test_metrics': test_metrics,
'y_pred_train': y_pred_train,
'y_pred_test': y_pred_test
}
print(f"{name} 评价指标:")
print(f" 训练集 - MSE: {train_metrics['MSE']:.3f}, RMSE: {train_metrics['RMSE']:.3f}, R²: {train_metrics['R²']:.3f}")
print(f" 测试集 - MSE: {test_metrics['MSE']:.3f}, RMSE: {test_metrics['RMSE']:.3f}, R²: {test_metrics['R²']:.3f}")
print()
# 创建评价指标对比表
metrics_df = pd.DataFrame({
name: [
results[name]['test_metrics']['MSE'],
results[name]['test_metrics']['RMSE'],
results[name]['test_metrics']['MAE'],
results[name]['test_metrics']['R²'],
results[name]['test_metrics']['R²_adj']
]
for name in results.keys()
}, index=['MSE', 'RMSE', 'MAE', 'R²', 'R²_adj'])
print("模型评价指标对比:")
print(metrics_df.round(3))
8.5. 残差分析¶
残差分析是回归模型诊断的重要方法,用于检查模型假设是否满足: