This project involves the development of a predictive model to estimate whether it will rain the next day. The project utilizes an ensemble of RandomForest and GradientBoosting classifiers, evaluated and optimized using grid search. The dataset used is sourced from Kaggle.
- Load the dataset.
- Drop any missing values.
- Encode categorical features using
LabelEncoder
. - Split the dataset into training (70%) and testing (30%) sets using stratified sampling to ensure class balance.
- Train individual models: RandomForest and GradientBoosting.
- Create an ensemble classifier using soft voting.
- Define parameter grids for both classifiers for grid search.
- Perform grid search with 10-fold cross-validation to find the best hyperparameters.
- Select the best model based on cross-validation results.
- Evaluate the model's accuracy on both training and testing sets.
- Print classification reports and confusion matrices for both sets.
- Assess potential overfitting by comparing cross-validation scores on training and testing sets.
- Calculate confidence intervals for the test set accuracy using normal approximation.
- Plot learning curves to visualize training and validation scores over different training set sizes.
- Plot validation curves to visualize the effect of different hyperparameters on model performance.