1.获取数据,定义问题
数据介绍:http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
数据下载地址:http://archive.ics.uci.edu/ml/machine-learning-databases/00294/
里面是一个循环发电场的数据,共有9568个样本数据,每个数据有5列,分别是:AT(温度), V(压力), AP(湿度), RH(压强), PE(输出电力)。我们不用纠结于每项具体的意思。
我们的问题是得到一个线性的关系,对应PE是样本输出,而AT/V/AP/RH这4个是样本特征, 机器学习的目的就是得到一个线性回归模型,即:
而需要学习的,就是这5个参数。
2.整理数据
3.Pandas读取数据
倒入依赖库
import matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom sklearn import datasets, linear_model
读取数据
data = pd.read_csv("./data/CCPP/CCPP.csv")data.head()
4. 准备运行算法的数据
X = data[["AT", "V", "AP", "RH"]]X
Y = data[["PE"]]Y.head()
5. 划分训练集和测试集
from sklearn.cross_validation import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1)print(X_train.shape)print(Y_train.shape)print(X_test.shape)print(Y_test.shape)
6. 运行scikit-learn的线性模型
from sklearn.linear_model import LinearRegressionlinreg = LinearRegression()linreg.fit(X_train, Y_train)print(linreg.intercept_)print(linreg.coef_)# Output:[460.05727267][[-1.96865472 -0.2392946 0.0568509 -0.15861467]]
7. 模型评价
我们需要评估我们的模型的好坏程度,对于线性回归来说,我们一般用均方差(Mean Squared Error, MSE)或者均方根差(Root Mean Squared Error, RMSE)在测试集上的表现来评价模型的好坏。
#模型拟合测试集Y_pred = linreg.predict(X_test)from sklearn import metrics# 用scikit-learn计算MSEprint("MSE:%d",metrics.mean_squared_error(Y_test, Y_pred))# 用scikit-learn计算RMSEprint("RMSE:%data",np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))# Output:MSE:%d 20.83719154722035RMSE:%data 4.564777272465805
8. 交叉验证
from sklearn.model_selection import cross_val_predictpredicted = cross_val_predict(linreg, X, Y, cv= 10)# 用scikit-learn计算MSEprint("MSE:%d",metrics.mean_squared_error(Y, predicted))# 用scikit-learn计算RMSEprint("RMSE:%data",np.sqrt(metrics.mean_squared_error(Y, predicted)))# Output:MSE:%d 20.793672509857537RMSE:%data 4.560007950635343
9. 画图观察结果
fig, ax = plt.subplots()ax.scatter(y, predicted)ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)ax.set_xlabel('Measured')ax.set_ylabel('Predicted')plt.show()




