This repository contains a script to classify time series based on their next data point predictability using various statistical metrics and features. The goal is to identify and rank time series according to their predictability and classify them into different groups.
- Introduction
- Time Series Analysis
- Statistical Metrics and Features
- Structure
- Usage
- Configuration
- Implementation Details
- Output
- Contributing
- License
Time series analysis is a crucial aspect of data science, used to analyze sequential data points collected over time. This repository demonstrates how to classify time series based on their predictability using various statistical methods.
A time series is a series of data points indexed in time order, typically with uniform intervals. Time series analysis involves understanding the underlying patterns such as trends, seasonality, and noise to make predictions.
- Trend: The long-term movement in the data.
- Seasonality: The repeating short-term cycle in the data.
- Noise: The random variation in the data.
The script calculates several statistical metrics and features to evaluate and rank the predictability of each time series. Here's an explanation of each:
-
SMAPE (Symmetric Mean Absolute Percentage Error):
-
MAPE (Mean Absolute Percentage Error):
-
AIC (Akaike Information Criterion):
- Measures the quality of a model relative to other models.
- Lower AIC indicates a better fit.
-
Degree of Differencing:
- Indicates the number of times the data needs to be differenced to achieve stationarity.
-
Autoregressive (AR) Terms:
- The number of lag observations included in the model.
-
Variance:
- Measures the spread of the data points.
-
Seasonality:
- Presence of repeating patterns at regular intervals.
-
Holiday Effect:
- The impact of holidays on the data.
-
Spikes and Dips:
- Presence of sudden increases or decreases in the data.
main.py
: The main script to execute the classification process.metrics.py
: Module containing functions to calculate metrics and handle preprocessing.config.py
: Configuration file for defining weights and other parameters.requirements.txt
: List of dependencies.outputs/
: Folder where the classification results are saved as a JSON file.
- Install the required dependencies:
pip install -r requirements.txt
- Run the main script:
python main.py
- Results: The results will be saved into outputs/results.json file, containing the names of the time series in each group.
# config.py
# Define weights for each metric
weights = {
'smape': 0.2,
'mape': 0.2,
'aic': 0.1,
'degree_of_differencing': 0.1,
'ar_terms': 0.1,
'variance': 0.1,
'seasonality': 0.1,
'holiday_effect': 0.05,
'spikes_and_dips': 0.05
}
# Define other parameters as needed
train_size_ratio = 0.8
The script uses synthetic data for demonstration purposes. Replace the sample data with your actual time series data.
The calculate_metrics function in metrics.py computes the various metrics for each time series. The metrics are normalized and weighted according to the configuration.
Time series are ranked based on the composite score, calculated by summing the weighted metrics. The ranked series are then classified into groups.
The results are saved as a single JSON file (outputs/results.json) with keys Group_1, Group_2, and Group_3, each containing the corresponding time series.
The output is a JSON file (outputs/results.json) structured as follows:
{
"Group_1": ["TS_1", "TS_2", "TS_3", ...],
"Group_2": ["TS_4", "TS_5", "TS_6", ...],
"Group_3": ["TS_7", "TS_8", "TS_9", ...]
}
Contributions are welcome! Please fork the repository and submit a pull request.