Introduction
Previously, we wrote about some common tradeoffs in machine learning and the importance of tuning models to your specific dataset. We demonstrated how to tune a random forest classifier using grid search, and how crossvalidation can help avoid overfitting when tuning hyperparameters (HPs).
In this followup post, you’ll beef up your machine learning toolbox by trying out some new, broadlyapplicable tools. You’ll learn a different strategy for traversing hyperparameter space – randomized search – and how to use it to tune two other classification algorithms – a support vector machine and a regularized logistic regression classifier.
We’ll keep working with the wine dataset, which contains chemical characteristics of wines of varying quality. As before, our goal is to try to predict a wine’s quality from these features.
Here are the things we’ll cover in this blog post:
In the next blog post, you will learn how to take these three different tuned machine learning algorithms and combine them to build an aggregate model ensemble. Building ensembles often leads to improved model performance and generalizability. Stay tuned!
Loading and train/test splitting the dataset
You start off by collecting the dataset. We have covered the data loading, preprocessing, and train/test splitting previously, so we won’t repeat ourselves here. Also check out this post on using plotly
to create exploratory, interactive graphics of the wine dataset features.
You can fetch and format the data as follows:
import wget import pandas as pd import numpy as np from sklearn.cross_validation import train_test_split # Import the dataset data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/wine/winequalityred.csv' dataset = wget.download(data_url) dataset = pd.read_csv(dataset, sep=";") # Using a lambda function to bin quality scores dataset['quality_is_high'] = dataset.quality.apply(lambda x: 1 if x >= 6 else 0) # Convert the dataframe to a numpy array and split the # data into an input matrix X and class label vector y npArray = np.array(dataset) X = npArray[:,:2].astype(float) y = npArray[:,1] # Split into training and test sets XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)
Introducing randomized search: comparison with grid search
You have already built a random forest classifier and tuned it using grid search to predict wine quality (here). Grid search is quite commonly used, and is essentially just a method that exhaustively tries out all combinations of manually prespecified HP values and reports the best option (i.e. the one leading to the highest test accuracy). The benefit of this approach is that you thoroughly test out various combinations, but this is of course very computationally expensive. For grid search to be tractable, you often have to restrict the number of combinations, which can severely limit how well you explore hyperparameter space and lead to you overlooking regions where accuracy would be highest.
Another way to search through hyperparameter space to find optima is via randomized search. In randomized search, you sample HP values a certain number of times from some distribution which you prespecify in advance. So unlike grid search, in which you specify particular numbers to combinatorially try out, you instead specify distributions that cover the HP space you want to explore. For example, you might specify a standard normal distribution over an HP if you think reasonable values are roughly centered around 0, or a uniform distribution over some range if you think values within that range are about as likely to be “good”. In randomized search, you also specify a n_iter
parameter, which acts as a computational budget, controlling how many different parameter settings are tried out in total.
We can visually summarize the grid search (grey boxes) and randomized search (purple boxes) strategies like so:
Here, both approaches are constrained by the same computational budget – they can only try out 9 different HP settings (i.e. certain values for HP1 and HP2). Randomized search tries out HP values from two normal distributions (purple bell curves), repeating the process 9 times and thus getting 9 different values of both HP1 and HP2. Most combinations fall into the meaty portion of the normal distributions, but occasionally the tails are sampled from as well – this means you have at least some chance of trying out distant regions that could potentially strike gold (i.e. the hypothetically optimal HP space leading to high accuracy, bottom right).
Meanwhile, grid search tries out 3 values each of HP1 and HP2. Of course, these values do not have to be as close to each other as we have drawn (and one could indeed hit the gold space with grid search), but the idea is that since you are constrained to trying out all combinations of prespecified HP values, this intrinsically limits how much of the HP space can be explored. Specifically, here randomized search has searched a space that is 16 times bigger (we drew a 3×3 box for the grid search and a 12×12 box for the larger grid). The n_iter
argument controlling the number of HP combinations to try out gives you access to a tradeoff between computational resources invested and the HP space you can explore.
Check out this paper outlining the efficiency of randomized search compared to grid search, especially in highdimensional HP spaces. You can imagine that if you already have 12×12=144 possible combinations of 2 HPs, adding another HP increases the number of possibilities to search through to 12x12x12=1728. This becomes very demanding very quickly and randomized search is the only feasible practical approach. However intuitively, were computational resources and patience infinite, grid search would become the better choice.
Using randomized search
Scikit makes using randomized search easy with RandomizedSearchCV
. You can feed distributions of HPs to the RandomizedSearchCV
object in two (fairly similar) ways:
1. You can either define distributions over HPs, without immediately sampling from them, and pass these distributions to RandomizedSearchCV
, which will proceed to sample n_iter
number of times with replacement from the distributions to generate candidate HP combinations.
2. You can sample from distributions immediately and pass a list of possible HP values to RandomizedSearchCV
, and it will sample from these possible values n_iter
number of times without replacement.
Both approaches lead to the same outcome and you will be using the second one here as it allows you to have a peak at the HP values that were sampled beforehand. Here is what this looks like with random forests:
from scipy.stats import uniform from scipy.stats import norm from sklearn.grid_search import RandomizedSearchCV from sklearn import metrics # Designate distributions to sample hyperparameters from n_estimators = np.random.uniform(70, 80, 5).astype(int) max_features = np.random.normal(6, 3, 5).astype(int) # Check max_features>0 & max_features<=total number of features max_features[max_features <= 0] = 1 max_features[max_features > X.shape[1]] = X.shape[1] hyperparameters = {'n_estimators': list(n_estimators), 'max_features': list(max_features)} print (hyperparameters)
{'n_estimators': [79, 79, 71, 71, 76], 'max_features': [3, 11, 8, 1, 5]}
You then run the random search:
from sklearn.ensemble import RandomForestClassifier # Run randomized search randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=20) randomCV.fit(XTrain, yTrain) # Identify optimal hyperparameter values best_n_estim = randomCV.best_params_['n_estimators'] best_max_features = randomCV.best_params_['max_features'] print("The best performing n_estimators value is: {:5d}".format(best_n_estim)) print("The best performing max_features value is: {:5d}".format(best_max_features)) # Train classifier using optimal hyperparameter values # We could have also gotten this model out from randomCV.best_estimator_ rf = RandomForestClassifier(n_estimators=best_n_estim, max_features=best_max_features) rf.fit(XTrain, yTrain) rf_predictions = rf.predict(XTest) print (metrics.classification_report(yTest, rf_predictions)) print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions), 3))
The best performing n_estimators value is: 71
The best performing max_features value is: 3
precision recall f1score support
0.0 0.79 0.84 0.82 188
1.0 0.85 0.81 0.83 212
avg / total 0.82 0.82 0.82 400
('Overall Accuracy:', 0.823)
Let’s compare this performance to the default random forest:
[bound method RandomForestClassifier.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)]
# Create default rf rf = RandomForestClassifier() print(rf.get_params) # Fit and predict with default rf rf.fit(XTrain, yTrain) rf_predictions = rf.predict(XTest) print (metrics.classification_report(yTest, rf_predictions)) print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions),3))
precision recall f1score support
0.0 0.73 0.85 0.79 188
1.0 0.85 0.73 0.78 212
avg / total 0.79 0.79 0.78 400
('Overall Accuracy:', 0.785)
Looks like the default performance is slightly lower, which is generally what you might expect. Either grid search or randomized search are good options for tuning random forests.
Let’s look at how to tune the two other predictors.
Tuning a support vector machine
Let’s train the second algorithm, a support vector machine (SVM
) classifier, to do the same wine quality prediction task. A great introduction to the theory behind SVMs can be found in Chapter 9 of the Introduction to Statistical Learning book or in this nice blog post. Briefly, SVMs search for separating hyperplanes in the feature space which best divide the different classes in your dataset. If you had 2 features, SVMs would search for the best dividing line; if you had 3 features, SVMs search for the best dividing 2d plane, etc. Crucially, SVMs can construct complex, nonlinear decision boundaries between classes by making use of a process called kernelling, which projects the data into a higherdimensional space and facilitates the identification of a good boundary.
SVMs can use different types of kernel functions, like linear, polynomial, Gaussian or radial kernels, to throw the data into a different space. Let’s use the popular radial basis function kernel (RBF kernel). In the case of RBF SVMs, the hyperparameters to tune include:
gamma
– it controls how influential a single observation can be when being selected as a support vector in the model. Low values forgamma
lead to large influence of individual observations and high values to less influence.C
– it controls the ‘softness’ of the classification boundary margin and hence the biasvariance tradeoff of the model. Lower values forC
will draw smoother decision boundaries (less flexible), whereas higher values will give more rugged boundaries that can fit the training data better (more flexible)
Examine the default HP settings and performance:
from sklearn import svm default_SVC = svm.SVC() print ("Default SVC parameters are: \n{}".format(default_SVC.get_params))
Default SVC parameters are:
[bound method SVC.get_params of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)]
Using the default settings, the SVM performance:
from sklearn.svm import SVC # Create, fit, and test default SVM rbfSVM = SVC(kernel='rbf') rbfSVM.fit(XTrain, yTrain) svm_predictions = rbfSVM.predict(XTest) print (metrics.classification_report(yTest, svm_predictions)) print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, svm_predictions),3))
precision recall f1score support
0.0 0.67 0.75 0.71 188
1.0 0.75 0.67 0.71 212
avg / total 0.71 0.71 0.71 400
('Overall Accuracy:', 0.71)
Now use randomized search to try to improve on this accuracy. First, define distributions you want to sample HP values from and create a dictionary of possible values:
# Designate distributions to sample hyperparameters from np.random.seed(123) g_range = np.random.uniform(0.0, 0.3, 5).astype(float) C_range = np.random.normal(1, 0.1, 5).astype(float) # Check that gamma>0 and C>0 C_range[C_range < 0] = 0.0001 hyperparameters = {'gamma': list(g_range), 'C': list(C_range)} print (hyperparameters)
{'C': [1.0322106068339623, 0.99484822790606153, 0.97957990353611057, 1.197934843277785, 0.83806999349632538], 'gamma': [0.20894075567935849, 0.085841800485113834, 0.068055436069260927, 0.16539443072486737, 0.21584069093566891]}
Now pass this dictionary to param_distributions
argument of RandomizedSearchCV
:
# Run randomized search randomCV = RandomizedSearchCV(SVC(kernel='rbf', ), param_distributions=hyperparameters, n_iter=20) randomCV.fit(XTrain, yTrain) # Identify optimal hyperparameter values best_gamma = randomCV.best_params_['gamma'] best_C = randomCV.best_params_['C'] print("The best performing gamma value is: {:5.2f}".format(best_gamma)) print("The best performing C value is: {:5.2f}".format(best_C))
The best performing gamma value is: 0.07
The best performing C value is: 0.84
We can examine the scores of e.g. the first 5 tested HP combinations:
print (randomCV.grid_scores_[0:5])
[mean: 0.67473, std: 0.01398, params: {'C': 1.197934843277785, 'gamma': 0.20894075567935849}, mean: 0.67556, std: 0.01087, params: {'C': 1.197934843277785, 'gamma': 0.21584069093566891}, mean: 0.67056, std: 0.00857, params: {'C': 0.97957990353611057, 'gamma': 0.21584069093566891}, mean: 0.67223, std: 0.01004, params: {'C': 1.0322106068339623, 'gamma': 0.21584069093566891}, mean: 0.67306, std: 0.01285, params: {'C': 1.0322106068339623, 'gamma': 0.20894075567935849}]
# Train SVM and output predictions rbfSVM = SVC(kernel='rbf', C=best_C, gamma=best_gamma) rbfSVM.fit(XTrain, yTrain) svm_predictions = rbfSVM.predict(XTest) print (metrics.classification_report(yTest, svm_predictions)) print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, svm_predictions),4))
precision recall f1score support
0.0 0.65 0.75 0.70 188
1.0 0.74 0.64 0.69 212
avg / total 0.70 0.69 0.69 400
('Overall Accuracy:', 0.6925)
Looks like we get similar accuracy as the default model in this case. This is fine, but in general you might think about doing things like casting the net wider to try out quite different HP values, adding new HPs to the tuning process, try a different learning algorithm, etc.
Tuning a logistic regression classifier
The final model you’ll tune and apply to predict wine quality is a logistic regression classifier (LogisticRegression
). This is a type of regression model which is used for predicting binary outcomes (like good wine/not good wine). Logistic regression fits a sigmoidal (Sshaped) curve through the data, but can be viewed as just a transformed version of linear regression – a straight line predicting the log odds of data points being in one of the two classes. A nice explanation of logistic regression can be found here.
One topic you will often encounter in machine learning is regularization, which is a class of techniques to reduce overfitting. The idea behind regularization is that you do not only want to maximize a model’s fit to your data, since this is susceptible to overfitting. Regularization techniques try to cut down on overfitting by penalizing models, for example if they use too many parameters, or if they assign coefficients or weights that are “too big”. Regularization means that models have to learn from the data under a series of constraints, which often leads to robust representations of the data.
You can adjust just how much regularization you want by adjusting regularization hyperparameters, and since this is something you might want to do often, scikitlearn comes with some prebuilt models that can very efficiently fit data for a range of regularization hyperparameter values. This is the case for regularized linear regression models like Lasso regression and ridge regression, which use l1 and l2 penalties, respectively, to shrink the size of the regression coefficients. These scikit modules offer a shortcut to performing crossvalidated selection of the regularization hyperparameter.
But you can also optimize how much regularization you want yourself, while at the same time tuning other hyperparameters (like the choice between l1 and l2 penalty), in the same manner as you’ve been doing.
Let’s examine default HP settings and performance for a logistic regression model:
# Tuning a regularized logistic regression model from sklearn.linear_model import LogisticRegression # Examine defaults default_lr = LogisticRegression() print ("Default logistic regression parameters are: {}".format(default_lr.get_params)) # Train model and output predictions classifier_logistic = LogisticRegression() classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain) logistic_predictions = classifier_logistic_fit.predict(XTest) print metrics.classification_report(yTest, logistic_predictions) print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),3)
Default logistic regression parameters are: [bound method LogisticRegression.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0)]
precision recall f1score support
0.0 0.73 0.74 0.73 188
1.0 0.77 0.75 0.76 212
avg / total 0.75 0.75 0.75 400
Overall Accuracy: 0.748
Now to optimise the HPs:
# Specify HP distributions penalty = ["l1", "l2"] np.random.seed(123) C_range = np.random.normal(1, 0.2, 10).astype(float) # Check that C>0 C_range[C_range < 0] = 0.0001 hyperparameters = {'penalty': penalty, 'C': C_range} print (hyperparameters)
{'penalty': ['l1', 'l2'], 'C': array([ 0.78287388, 1.19946909, 1.0565957 , 0.69874106, 0.88427995,
1.33028731, 0.51466415, 0.91421747, 1.25318725, 0.82665192])}
And feeding these values into RandomizedSearchCV
:
# Randomized search using crossvalidation randomCV = RandomizedSearchCV(LogisticRegression(), param_distributions=hyperparameters, cv=20) randomCV.fit(XTrain, yTrain) best_penalty = randomCV.best_params_['penalty'] best_C = randomCV.best_params_['C'] print ("The best performing penalty is: {}".format(best_penalty)) print ("The best performing C value is: {:5.2f}".format(best_C))
The best performing penalty is: l2
The best performing C value is: 0.70
We can now use these values to train a new, hopefully better model:
# Train model and output predictions classifier_logistic = LogisticRegression(penalty=best_penalty, C=best_C) classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain) logistic_predictions = classifier_logistic_fit.predict(XTest) print metrics.classification_report(yTest, logistic_predictions) print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),3)
precision recall f1score support
0.0 0.72 0.73 0.73 188
1.0 0.76 0.75 0.76 212
avg / total 0.74 0.74 0.74 400
Overall Accuracy: 0.743
This also actually looks quite similar to the default performance. It’s always important to tune hyperparameters, but the effect might vary across learning algorithms and datasets.
Conclusion
In this tutorial, you have done randomized search to optimize a random forest, an SVM model, and logistic regression. Fancier techniques for hyperparameter optimization include methods based on gradient descent, grad student descent (not recommended), and Bayesian approaches which update prior beliefs about good values of hyperparameters based on observed performance (see Spearmint and hyperopt). The important thing is to try out different learning algorithms and HP settings on your dataset.
In the next post, we will try to leverage the contribution of different types of models by building them up into an ensemble model, with the aim of increasing predictive performance.
Quiz
 Explain why the main advantages of randomized search over grid search are the following (as stated by scikit):
 A budget can be chosen independent of the number of parameters and possible values.
 Adding parameters that do not influence the performance does not decrease efficiency.

RandomizedSearchCV
will throw an error ifn_iter
is greater than the number of possible HP combinations (e.g.ValueError: The total space of parameters 25 is smaller than n_iter=100.
). Why is this a useful property? What happens if the number of combinations is indeed smaller thann_iter
, but some of the combinations happen to be duplicates? 
If the HP search space is quite small, the behaviour of randomized search can approach that of grid search. Specifically, if you were to sample values of two HPs from uniform distributions n times and set
n_iter
ton*n
, then randomized search would be very similar to a grid search over the same HP ranges. But it wouldn’t be quite identical – why?
ABOUT THE AUTHORS
Natasha Latysheva
Natasha is a computational biology PhD student at the MRC Laboratory of Molecular Biology. Her research is focused on cancer genomics, statistical network analysis, and protein structure. More generally, her research interests lie in dataintensive molecular biology, machine learning (especially deep learning) and data science.
Charles Ravarani
Charles is a research associate at the MRC Laboratory of Molecular Biology. Since the time of his PhD in Computational Biology, Charles has been working with largescale genomic datasets to build molecular models of gene expression noise that ultimately improve the efficiency of current drug treatments. In his research, machine learning techniques have proven very useful and he is keen to be involved in the wider ML community to learn about new techniques.