In this project, I compared several regression models to predict MLB player salaries using the Hitters dataset. The goal is to understand how different modeling choices affect performance, using cross validation and mean squared error to compare diffrent models.
I focused on how standardization and regularization, specificly Ridge, Lasso, and Elastic Net, help control overfitting, especially when predictors are highly correlated. I also test whether adding interaction terms leads to significant improvements in model performance.
Overall, this lab emphasizes evaluating models based on generalization rather than complexity alone.
Key takeaways - Cross validation provides a more reliable way to compare models than training error alone. - Regularization improves model stability when predictors are correlated. - Interaction terms can improve performance, but only when they are well motivated and carefully evaluated.
Import Data & Clean Data
import pandas as pdhitters = pd.read_csv("C:/Users/spink/OneDrive/Desktop/Machine Learning/Data/Hitters (2).csv")hitters = hitters.dropna()hitters.head()
AtBat
Hits
HmRun
Runs
RBI
Walks
Years
CAtBat
CHits
CHmRun
CRuns
CRBI
CWalks
League
Division
PutOuts
Assists
Errors
Salary
NewLeague
1
315
81
7
24
38
39
14
3449
835
69
321
414
375
N
W
632
43
10
475.0
N
2
479
130
18
66
72
76
3
1624
457
63
224
266
263
A
W
880
82
14
480.0
A
3
496
141
20
65
78
37
11
5628
1575
225
828
838
354
N
E
200
11
3
500.0
N
4
321
87
10
39
42
30
2
396
101
12
48
46
33
N
E
805
40
4
91.5
N
5
594
169
4
74
51
35
11
4408
1133
19
501
336
194
A
W
282
421
25
750.0
A
Part 1: Diffrent Model Specs
A. Regression without regularization
Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary linear regression
A player with 1 SD more at the bat than the mean will learn -291,000 less if nothing else changes. A player with 1 SD increase in hits will get paid 338,000 more if nothing else changes. One SD increase in home runs leads to 38,000 more pay if nothing else changes.
Use cross-validation to estimate the MSE you would expect if you used this pipeline to predict 1989 salaries.
from sklearn.model_selection import cross_val_scorescores = cross_val_score(pipeline_1, X, y, cv=5, scoring='neg_mean_squared_error')scores# 1989 slalires would be new data, so the current mse if a decent predictormse =-scores.mean()mse
np.float64(121136.31031816883)
B. Ridge regression
Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Players with more putouts tend to have higher salaries. Players who get more hits tend to earn higher salaries. Players with more career runs batted in have higher salaries. For everything just mentioned, other factors have to be controlled for.
Report the MSE you would expect if you used this pipeline to predict 1989 salaries.
from sklearn.model_selection import cross_val_scorescores = cross_val_score(top_ridge_pipeline, X, y, cv=5, scoring='neg_mean_squared_error')scores# 1989 slalires would be new data, so the current mse if a decent predictormse =-scores.mean()mse
np.float64(120716.43558937623)
C. Lasso Regression
Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# wanted to check something actualy changed because I got the same top coefficantslasso_coefficients
Names
Coefficient
Abs_coef
0
standardize__AtBat
-0.000000e+00
0.567370
1
standardize__Hits
8.874162e+01
49.612386
2
standardize__HmRun
0.000000e+00
1.464159
3
standardize__Runs
0.000000e+00
29.343263
4
standardize__RBI
0.000000e+00
22.958015
5
standardize__Walks
4.990279e+01
41.384617
6
standardize__Years
-0.000000e+00
2.708306
7
standardize__CAtBat
0.000000e+00
24.705844
8
standardize__CHits
0.000000e+00
44.534276
9
standardize__CHmRun
0.000000e+00
38.685330
10
standardize__CRuns
7.222808e+01
45.507606
11
standardize__CRBI
1.340315e+02
47.145556
12
standardize__CWalks
-0.000000e+00
4.036371
13
standardize__PutOuts
6.673703e+01
56.881522
14
standardize__Assists
0.000000e+00
7.457239
15
standardize__Errors
-4.158285e+00
13.382390
16
dummify__League_A
-0.000000e+00
11.051842
17
dummify__League_N
0.000000e+00
11.051842
18
dummify__Division_E
9.541375e+01
38.023222
19
dummify__Division_W
-3.376379e-12
38.023222
20
dummify__NewLeague_A
-0.000000e+00
4.091590
21
dummify__NewLeague_N
0.000000e+00
4.091590
Same conclusions as found from Ridge Regression.
Report the MSE you would expect if you used this pipeline to predict 1989 salaries.
from sklearn.model_selection import cross_val_scorescores_lasso = cross_val_score(top_lasso_pipeline, X, y, cv=5, scoring='neg_mean_squared_error')scores_lasso# 1989 salaries would be new data, so the current mse if a decent predictormse =-scores_lasso.mean()mse
np.float64(121828.10222019239)
D. Elastic Net
Create a pipeline that includes all the columns as predictors for Salary, and performs ordinary ridge regression
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Report the MSE you would expect if you used this pipeline to predict 1989 salaries.
from sklearn.model_selection import cross_val_scorescores_elastic = cross_val_score(top_elastic_pipeline, X, y, cv=5, scoring='neg_mean_squared_error')scores_elasticmse =-scores_elastic.mean()mse
np.float64(121374.33335443735)
Part II. Variable Selection
I think PutOuts is the most import because its the highest ranked in all three models. I think the five numeric variabels that are most important are PutOuts, CRIB, HITS, CRuns, CHits becase they remined the ghiest ranked even as other changed with the diffrent penatlites appplied. I think Division (DW and DE are the same significance) is the most importatn categorical variable becuase it consitantly had the highest coeffciant.
Top 5: Coefficants for OLS: standardize__AtBat -291.094556 1 standardize__Hits 337.830479 2 standardize__HmRun 37.853837 3 standardize__Runs -60.572479 4 standardize__RBI -26.994984
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.153e+05, tolerance: 4.708e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.011e+06, tolerance: 3.606e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.540e+05, tolerance: 4.137e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.329e+04, tolerance: 4.281e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.932e+05, tolerance: 4.558e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.913e+04, tolerance: 3.606e+03
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.432e+03, tolerance: 4.558e+03
np.float64(120189.15162209612)
For lasso the interaction feature set has the lowest MSE.
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.175e+04, tolerance: 4.708e+03
model = cd_fast.enet_coordinate_descent(
C:\Users\spink\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.117e+04, tolerance: 3.606e+03
model = cd_fast.enet_coordinate_descent(
np.float64(117621.08422157416)
The interaction model is the best with MSE again.
Part III. Discussion
A. Compare your Ridge models with your ordinary regression models. How did your coefficients compare? Why does this make sense?
The Ridge regession coeffciant were smaller than the OLS coeffeicants which makes sence becasue Ridge applies a penalty while OLS does not.
B. Compare your LASSO model in I with your three LASSO models in II. Did you get the same Lamda results? Why does this make sense? Did you get the same MSEs? Why does this make sense?
No, we didn’t get the same lambdas for the LASSO model in part 1 and the models in part two because we tuned for different numbers of variables. We also got the same MSE because we used different variables and different numbers of variables to predict salary, both between parts 1 and 2, and the feature sets in part 2.
C. Compare your MSEs for the Elastic Net models with those for the Ridge and LASSO models. Why does it make sense that Elastic Net always “wins”?
Because elastic balances the ridge and lasso and gets the best of both worlds. It smooths over the curve of the coefficients while also reducing the unimportant ones to zero, but it doesn’t do much of either. So it performs better than ridge and lasso, and performs better than OLS because it applies a penalty to make new variables justify themselves, so to speak.
Part IV: Final Model
Fit your final best pipeline on the full dataset, and summarize your results in a few short sentences and a plot.
from plotnine import ggplot, aes, geom_point, geom_ablineplot_data = pd.DataFrame({"salary": y,"ols int predicted salary": ols.predict(XInt)})ggplot(plot_data, aes(x="ols int predicted salary", y="salary")) + geom_point() + geom_abline(intercept=0, slope=1, color="blue")
The OLS feature set that includes the interactions between the most significant categorical variable and the five most significant numeric variables has the lowest MSE, which means it is the most accurate. The graph shows that the OLS interaction model is decently accurate.