Beyond Accuracy: Building Trustworthy Extreme Events Predictions Through Explainable Machine Learning

: Extreme events, despite their rarity, pose a significant threat due to their immense impact. While machine learning has emerged as a game-changer for predicting these events, the crucial challenge lies in trusting these predictions. Existing studies primarily focus on improving accuracy, neglecting the crucial aspect of model explainability. This gap hinders the integration of these solutions into decision-making processes. Addressing this critical issue, this paper investigates the explainability of extreme event forecasting using a hybrid forecasting and classification approach. By focusing on two economic indicators, Business Confidence Index (BCI) and Consumer Confidence Index (CCI), the study aims to understand why and when extreme event predictions can be trusted, especially in the context of imbalanced classes (normal vs. extreme events). Machine learning models are comparatively analysed, exploring their explainability through dedicated tools. Additionally, various class balancing methods are assessed for their effectiveness. This combined approach delves into the factors influencing extreme event prediction accuracy, offering valuable insights for building trustworthy forecasting models.


Introduction
Extreme events have attracted huge attention of researchers last few decades considering their two-edged threat: small number but large in impact which poses a particularly difficult quandary (Chen, Gupta, and Tragoudas 2022;Ghil et al. 2011).Studies on the topic cover their summarization, detection, and prediction in different areas such as finance, weather, etc (Zhao 2020).On the aspect of their prediction, with the large amount of data generated day-today, machine learning (deep learning) is seen as a game changer due to its ability to capture hidden patterns in data to generate accurate predictions, which represents the major limitation of the statistical approach.Thus, powerful frameworks have been developed to provide better results in this scope of studies.However, techniques applied in forecasting of

Suggested Citation
Mukendi, C.M., Itakala, A.K. & Tibasima, P.M. (2024).Beyond Accuracy: Building Trustworthy Extreme Events Predictions Through Explainable Machine Learning.European Journal of Theoretical and Applied Sciences, 2 (1), 199-218.DOI: 10.59324/ejtas.2024.2(1).15 200Extreme Events are generally in their infancy and mostly domain specific and despite the large advances achieved in their forecasting resulting from the use of machine learning techniques, the crucial difficulty lies not so much in finding new prediction methods but in finding ways to trust these predictions.The available literature considers mostly the improvement in forecasting accuracy such as in (Ding et al. 2019) which propose improvement in forecasting of extreme events considering the extreme value loss instead of the quadratic loss.The analysis and prediction of the outbreak of corona virus is proposed in (Petropoulos and Makridakis 2020) where machine learning was able to unveil hidden valuable information to detect future outbreak of the pandemic.Due to the imbalance classes in the prediction of extreme events, a technique based on block resampling in joint predictor forecast space is proposed in (Chen et al. 2022).While these studies propose various outstanding approach to improve accuracy in the prediction of extreme events, they lack to give insight in the importance of explainability of these models applied in the case of extreme events forecasting.It is becoming more complex for humans to interpret these models considering their increasing complexity.This situation has the potential to affect their confidence in the integration of those solutions into the decisionmaking process (Wu et al. 2021;Zhao 2020) despite their outstanding performance.This justifies this paper which addresses the aspect of the explainability of forecasting of extreme events using the hybrid forecasting and classification approach, which aimed to determine whether the events in a close future is a normal of extreme.This approach deepens one's understanding on why and when the prediction of extreme events in a forecasting and classification aspect should be trusted.A case study focused on two crucial lagging indicators in economics and business: the Business Confidence Index (BCI) (Anon n.d.-a) and the Consumer Confidence Index (CCI) (Anon n.d.a) is considered.These indices, often used to gauge consumer and business sentiment within a specific period, can fluctuate in tandem and influence significant economic events like downturns, job losses, and financial stress (Anon 2014;Bielova, Halík, and Ryabushka 2021;Juhro and Iyke 2020;Teresiene et al. 2021).Our goal is to utilize these indices to predict their joint trend for the next month and determine whether it indicates an economic downturn (extreme event), based on a pre-defined threshold.This threshold defines values below it as representing an extreme event and values above it as normal.
To achieve this, the remaining of the paper consists in section I which is the introduction followed by section II which deals with the methodology and explanation of different concepts considered in this research.Section III is dedicated to the result and discussion and the last section IV is the conclusion.

Hybrid Forecasting Classification at Glance
This forecasting method applied in extreme events combines two forecasting approaches, regression and classification, using either one or multiple models on either univariate or multivariate forecasting approach to answer the question related to the forecasting of events in the future and the detection of extreme events in that predicted future ( X. Chen et al., 2022;Ghil et al., 2011;Zhao, 2020).
Divided in two stages (Figure 1), this technique, in the scope of this study, consists first, after data acquisition and preprocessing, in defining the threshold that defines events through a defined labelling, here zero as normal event and one as extreme event.A formal supervised learning technique for time series forecasting is applied on the data to predict the next values.The second stage consists in using a classifier to train the predictions obtained through the first approach and classify the prediction probability of each value based on the preestablished labels for each event and evaluate the final result based on the goodness of matching and the quality of matched predictions.The two most popular distributions studied in the Extreme Value Theory EVT (Abdulali et al. 2022) were considered.Namely, the Generalized Extreme Value (governs the distribution of its block maxima) and Generalized Pareto distributions (concerned with the distribution of excess values over a certain threshold) (Abdulali et al. 2022;Castro-Camilo, Huser, and Rue 2022;De Zea Bermudez and Kotz 2010;Galib et al. 2022).The thresholding method considered is based on the estimation of the probability of events which offer a robust and flexible solution that does not depend on the distribution of the data.The minimum value resulting from the fitting of the GEV and GP applied on the dataset with an estimation of 0.5 percent of confidence is the threshold below which, any value is extreme.
The fitting the GEV Distribution is characterized by three parameters (estimate using the Maximum Likelihood Estimation (MLE) method): location (μ), scale (σ), and shape (ξ).
The cumulative distribution function (CDF) of the GEV distribution is given by: The fitting the GP Distribution is used to model the extreme values above a certain threshold.
Assuming that values beyond the threshold follow a GP distribution, the GP distribution has two parameters (estimate these parameters using the MLE method): shape (ξ) and scale (σ).
The CDF of the GP distribution is given by: To determine the threshold below which any value is considered extreme, any value that corresponds to the 0.5 percent quantile of the 202 fitted distribution is selected.This quantile is obtained using the inverse of the CDF.
Once the threshold T (threshold) is determined, any value below it can be considered extreme.

Machine Learning Algorithms
Two sets of algorithms were considered.The first for time series forecasting and the second for classification.
For the first, the combination of the

Interpretability Techniques
Three popular interpretation methods to interpret the predictions: a.
Feature importance: In this hybrid context, this interpretability techniques aims to explain or reveal the importance of specific features or neurons within these models, making it easier to understand and trust the model's decision-making process.(Chakraborty et al. 2017;Hooker et al. 2019;Kim and Cho 2019;Wei et al. 2020;Wojtas and Chen 2020). b.
The "Local Interpretable Model-agnostic Explanations" (LIME), used for the classification aspect.It is useful to explain the predictions of complex machine learning models in a locally interpretable manner(Ankit The "Explain Like I'm 5"m known as Eli5, is a Python library which is integrated with multiple frameworks that provides a comprehensive set of tools for explaining machine learning models and making them more accessible and understandable (Khanna et al. 2023;Vij and Nanjundan 2022).

Class Balancing Methods
Two common methods of balancing classes are considered, namely, the Class weight (technique used to assign different weights to the classes based on their imbalance) (Spelmen and Porkodi 2018) and Synthetic Minority Over-sampling Technique known as SMOTE (technique is an oversampling technique specifically designed to address class imbalance in generating synthetic samples for the minority class by interpolating between the feature vectors of existing minority class instances) (Wu et al. 2022).

Research Design
From the acquisition of data to the explanation of the performance of machine learning algorithms, each step considered in this study is illustrated in Figure 3.

Metrics
Two sets of metrics were considered.For the regression,

•
The mean squared error, mean absolute error, represented respectively in the equations (4) and (5).
Where n is the number of samples, Yp,i is the predicted output and Ya,i is the actual output, | expression | is the absolute value of the operation(Singla, Duhan, and Saroha 2022);
For the classification approach, considering the Actual positives that are correctly predicted positives are called true positives (TP), Actual positives that are wrongly predicted negatives are called false negatives (FN), Actual negatives that are correctly predicted negatives are called true negatives (TN) and the Actual negatives that are wrongly predicted positives are called false positives (FP)(Chicco and Jurman 2020): The following metrics were considered:

•
The confusion matrix: • The harmonic means of precision and recall (F1 score) represented by the equation ( 8)

Results and Discussion
Python 3.10 was used on google colab.Keras was used to build the deep learning model and the scikit learn for the machine learning models.GEV and GP were used to define extreme events in the dataset and the genextreme and genepareto were called from the Scipy library.

Defining Threshold
The result provides that the GEV distribution is suitable to define the threshold (Table 2)

Figure 6. GEV Fitted Distribution
The GEV is fitting the two variables of the dataset compared to the GP (Figure 6).Thus, having the thresholds defined, through the pseudo code 1, events are defined (1 as extreme event and 0 as normal events).
Pseudo code 1: While a total of 88 values fell below the threshold (41 for BCI and 47 for CCI, as shown in Figure 6a), only 16 of these events qualified as "extreme" based on the study's criteria (Figure 6b and Table 2).This resulted in a significant class imbalance, with 744 normal events compared to only 16 extreme events (Figure 7a).After partitioning the data into training and testing sets, the class distributions remained imbalanced, as depicted in Figures 7b and 7c     207 500 points are considered as normal events while 9 are extreme events, giving a total of 509 points for the train set which is 67% of the data set and for the test set, 244 are considered as normal while 7 are extreme events, giving 251 data points which represent 33 % of the dataset.This data will be utilized for prediction using the selected algorithms and the result of prediction will be compared to these values to evaluate their performance.

Forecasting Algorithms
After preparing the data in a format suitable for timeseries forecasting for short-term prediction (one step ahead), and reshaping it to meet the shaping of the CNN_LSTM and ConvLSTM algorithms, the structure of these algorithms are as follows in Figures 8a and 8b:      The CNN_LSTM on the test set which tends to deviate from the actual data (Figure 9a) while the ConvLSTM depicts a better behaviour (Figure 9b).In trying to detect events in the predictions provided by these models, it appears that, due to the error of prediction, their number were either increased or decreased.Figure 11a and 11b depicts this.

211
Both models were able to identify most of the extreme events in the training data, with ConvLSTM matching the original number perfectly (9) and CNN_LSTM coming close (12).However, their performance on the test set differed.CNN_LSTM generated extra false positives by predicting more extreme events than actually occurred (8 vs. 7).ConvLSTM, on the other hand, missed one genuine extreme event (6 vs. 7).While ConvLSTM missed one crucial event, its overall performance was more precise by avoiding false positives.The next step will involve using these predictions as input for classification algorithms to further analyze and refine the identification of extreme events.

Classification Algorithms
The hyperparameters of these algorithms, especially the one addressing the balancing of classes are provided in Table 5.For the rest, their individual default values were considered in the case of imbalance classes and SMOTE.Thus, their performance in each case is provided in Table 6 (Appendix 1).
Imbalanced class scenario: • XGBoost and Naive Bayes top performers: Both achieve near-perfect AUC, TP, TN, and MCC scores.

•
XGBoost: High precision and recall, low FN and FP.

•
Naive Bayes: Perfect AUC, precision, and recall.Captures all extreme events.

•
Balancing with SMOTE: Improves AUC and other metrics compared to class weight.

Layers Feature Importance
The contribution of each layer of the forecasting model is provided in Table 7.Based on this result, here are some observations: • CNN layers contribute similarly to prediction in both CNN_LSTM and ConvLSTM models.

•
LSTM and Dense layers in CNN_LSTM have comparable importance, suggesting similar roles.

•
Dense layer in ConvLSTM contributes slightly less than in CNN_LSTM.ConvLSTM outperforms other architectures due to its ability to capture complex data patterns, demonstrating the importance of layer choice and feature length in model performance.While increasing feature length can be computationally demanding, it can also significantly enhance model accuracy when wellsuited to the data and algorithm.

Lime
For each model, the Lime method was employed to understand each instance-based forecasting.Considering the large number of experiments, only the result of the CNN_LSTM for each scenario is presented in Table 8.The BCI and CCI weights have relatively low importance in the Naive Bayesian model's predictions, suggesting their limited impact on the target variable (Table 9).While SMOTE increases their feature importance, it risks misinterpretation due to the use of synthetic data.

Findings
• Feature importance and model performance depend on data and algorithm choice.
• ConvLSTM's greater feature length offers advantages for complex cases.

Conclusion
Trustworthy predictions are crucial in machine learning, but understanding how algorithms work can be tricky.This study digs into the challenges of interpreting predictions for imbalanced data, especially when it comes to extreme events like financial crashes or natural disasters.Many assume imbalanced data means worse predictions.But balancing it can also create complications.This study shows that the most important factors for accurately predicting and detecting extreme events are the data itself and the chosen algorithm.Balancing techniques actually have less impact, and might even mislead the model.It is important to note that this study was not about predicting specific events; it was about building a framework for analyzing extreme events using machine learning.While the limited data means the results are not universally applicable, they offer valuable insights into tackling imbalanced data in extreme event prediction.This field requires domainspecific knowledge, so future research will focus on long-term forecasting to identify more factors that define extreme events.This includes things like their duration, which is a big challenge in itself. 201

Figure 1 .
Figure 1.Hybrid Forecasting Classification Design Data preparation involved setting a threshold for extreme event detection, transforming data for supervised learning, and splitting it into train/validation/test sets.Data is scaled using MinMaxScaler, and regressors are trained for short-term forecasting.Events are defined based on predictions, and classification models are trained considering class imbalance (using class weight and SMOTE).Prediction probabilities, model performance metrics, and interpretability methods are assessed. 203

Figure 3 .
Figure 3. Activity Diagram There are no missing values and all the values are quite close.The timeseries plot depicts the absence of trend and presence of stationarity, confirmed by the p-value provided by the Augmented Dickey-Fuller test which is below 0.05 for each variable.(p_value_BCI = 5.237877118736921e-11, p_value_CCI = 0.016789104416045534).The dataset is suitable for a time series forecasting without further preprocessing except the scaling. 204

For
Figure 6b depicts extreme events detected in the dataset following the pseudo code 1. . 206

Figure
Figure 6a.Values below Threshold Figure 6b.Detected Extreme Events

Figure
Figure 7a.Detected Events

Figure 8a .
Figure 8a.Structure of the CNN_LSTM Figure 8b.Structure of the ConvLSTM

Figure
Figure 9a.RMSE by Model Figure 9b.MAE by Model Figure 9c.Accuracy by Model

Figure
Figure 9a.Actual vs Prediction with CNN_LSTM

Figure 11a .Figure
Figure 11a.Extreme Events in Prediction of CNN_LSTM

Figures
Figures 11a and 11b visualize these variations. 212

Figure 11a .
Figure 11a.Layer Feature Importance of the CNN_LSTM

•
Algorithm capability often outweighs class balancing for extreme event prediction.

Table 1
provides the description of the dataset and Figure4depicts its timeseries representation.

Table 4
summarizes the results.