Single Estimator Versus Bagging: Bias-Variance Decomposition in Scikit Learn

Single Estimator Versus Bagging: Bias-Variance Decomposition in Scikit Learn

The Bias-Variance trade-off is a fundamental concept in machine learning that every practitioner should understand. When we talk about a model's error, it can be decomposed into three components:

  1. Bias: This is the error introduced by approximating a real-world problem (which may be complex) by a too-simplistic model. High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting).

  2. Variance: This is the error introduced by an overly complex model that tries to fit the noise in the training data. High variance can cause an algorithm to model the random noise in the training data, leading to poor performance on unseen data (overfitting).

  3. Irreducible error: This is the noise term. It's the error inherent in any problem and is usually derived from the data itself (e.g., noise or randomness).

Single Estimator vs. Bagging

  • Single Estimator (e.g., a Decision Tree)
    • Bias: A deep decision tree (with many levels) can model a wide variety of relations and might have a low bias.
    • Variance: A deep decision tree can be very sensitive to small changes in the data, leading to a high variance.
  • Bagging (Bootstrap Aggregating)
    • Bias: Bagging doesn't necessarily reduce the bias. The bias remains similar to any individual model.
    • Variance: By averaging the results of multiple models (like decision trees in a Random Forest), bagging reduces the variance. This happens because the individual model's errors cancel each other out when averaged.

Demonstration using Scikit-learn

Let's demonstrate this using a decision tree and its bagged version (Random Forest) on a toy dataset.

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a toy dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Decision Tree
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))

# Bagging with Decision Trees
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
print("Bagged Decision Trees Accuracy:", accuracy_score(y_test, y_pred_bag))

In this demonstration, while individual decision trees might overfit, the bagged version will typically generalize better and have a better balance between bias and variance.

Note: While the above demonstration focuses on accuracy, in a real-world scenario, a more thorough analysis would involve studying learning curves, variance, and bias error quantitatively, possibly using techniques like cross-validation.


More Tags

path-parameter android-optionsmenu horizontalscrollview sqldataadapter s4hana meta angular-data struts2 enumeration ng-packagr

More Programming Guides

Other Guides

More Programming Examples