In this post we will introduce multivariate adaptive regression splines model (MARS) using python. This is a regression model that can be seen as a non-parametric extension of the standard linear model.
The multivariate adaptive regression splines model
MARS builds a model of the from
where is a sample vector,
is a function from a set of basis functions (later called terms) and
the associated coefficient.
The basis functions are one of the following:
- constant 1, the so called intercept to reduce bias,
- a hinge function
or
(also called rectifier functions), where
is a constant, to model non-linearities,
- a product of two or more hinge functions to automatically model interactions between features.
How do we learn?
The MARS algorithm does learning in two steps. First we overgenerate a set of basis functions in the so called “forward pass”. The forward pass searches for terms that locally minimize squared error loss on the training set. Here we just add basis functions in a greedy way one after another. In The second step, the “pruning pass”, we remove unnecessary basis functions. The pruning pass uses generalized cross validation (GCV) to compare the performance of model subsets in order to choose the best subset: lower values of GCV are better. The GCV is a form of regularization: it trades off goodness-of-fit against model complexity.
The formula for the GCV is
where RSS is the residual sum-of-squares measured on the training data and N is the number of trainings samples.
The EffectiveNumberOfParameters is defined in the MARS context as
The GCV formula penalizes the addition of terms. Thus the formula adjusts the training RSS to take into account the flexibility of the model. We penalize flexibility because models that are too flexible will model the specific realization of noise in the data instead of just the systematic structure of the data.
A working example in python
Now we will look at a working example in python. For this we use the implementation provided by the pyearth package. We will use the data set from the “House Prices: Advanced Regression Techniques” challenge on kaggle. The data set contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
import numpy as np
from scipy.stats import skew
from scipy.stats.stats import pearsonr
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
We start with some basic data processing. We’re not going to do anything fancy here:
- First I’ll transform the skewed numeric features by taking
, this will make the features more normal
- Create Dummy variables for the categorical features
- Replace the numeric missing values (NaN’s) with the mean of their respective columns
train["SalePrice"] = np.log1p(train["SalePrice"])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
About model parameters
The pyearth MARS model offers some parameters to tune. I list the most important ones and how they influence the model.
- max_terms: The maximum number of terms generated by the forward pass. Decreasing the number will reduce overfitting but can also limit the expressiveness of the model. Typically only one or two degrees of interaction are allowed, but higher degrees can be used when appropriate.
- max_degree: The maximum degree of terms generated by the forward pass. Increasing this parameter will increase the non-linearity of the model but can lead to overfitting and increase the computation time a lot.
- penalty: The penalty parameter used to calculate GCV. Used during the pruning pass and to determine whether to add a hinge or linear basis function during the forward pass.
Now we run MARS and look at the performance.
from pyearth import Earth
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
mars = Earth(max_degree=1, penalty=1.0, endspan=5)
cv_mars = rmse_cv(mars).mean()
print(cv_mars)
This gives us a rooted mean squared logarithmic error of 0.126. Looks not so bad. This is definitely a good start.
Go further with boosted MARS
We can use the MARS model like decision trees for boosting. Scikit-learn offers a convenient way to do this with the AdaBoostRegressor. Just plug it in there and here we go:
boosted_model = AdaBoostRegressor(base_estimator=mars, n_estimators=25, learning_rate=0.1, loss="exponential")
All this features make the MARS model a very handy tool in your machine learning toolbox.
Go out and use it!
You might also be interested in:
- Detecting Network Attacks with Isolation Forests: In this post, I will show you how to use the isolation forest algorithm to detect attacks to comput …
- Image segmentation with test time augmentation with keras: In the last post, I introduced the U-Net model for segmenting salt depots in seismic images. This ti …
- Explain neural networks with keras and eli5: In this post I’m going to show you how you can use a neural network from keras with the LIME …
08/15/2018 at 8:12 am
Hi Tobias,
Have you tried parameter tuning for MARS. Could you suggest the range for each of these parameters you mentioned.
I am thinking of using simple for loops to iterate through each combinate and optimise RMSE/MAPE etc
08/16/2018 at 5:46 am
Hi Joel,
note that you can use the scikit-learn gridsearch or randomsearch classes to find hyperparameters suitable for your problem.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
A good start for max_terms is (2*n + m) // 10, where n is the number of features and m the number of samples. But you should not go to high. I would limit the number of terms at least around 500.
For max_degree I would try low values like 1,2,3,4. Higher values lead to overfitting for me. But if your model still underfits you can try higher values.
The penalty value, I select from [1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10, 100, 1000].
I hope this helps you. But feel free to ask more 🙂
08/20/2018 at 6:04 pm
Hello Tobias – this tutorial is quite helpful, thanks a lot for sharing all of your hard work!
Cheers, mate
08/20/2018 at 8:02 pm
Your very welcome! I’m happy that it helps you.
08/23/2018 at 11:11 am
do you have any advice for max terms that come out to be more than 500?
i’ve done gridsearchcv and k-fold CV for 8 models and evaluated the mean scores of MAE and RMSE for each…and MARS still had the best results…i’m just curious if i’ve spent enough time optimizing the tuning parameters to get the best modeling scores?
the computational expense (on a macbook air in jupyter notebook) seems to be greater for MARS when optimizing some of the tuning parameters than some other models, even MLP-ANN, KNN, et al…
08/25/2018 at 7:01 pm
I would not recommend using more terms than 500. Otherwise it will be really costly to compute and hard to regularize. What was the optimal value you found?
You can also try to use a simpler MARS model and use a boosting approach to improve performance.