Exploration and Analysis of Power Outages in the Continental United States

by Tom Hocquet and Julia Ma

GIF

Introduction

The topic of electricity is important to understand, as it is something that is now essential to daily life. Electricity is used to maintain machinery, electronics, public transportation systems, etc. Electricity also serves as a basis for satisfying fundamental human needs, such as food production, clean water, sanitation, education services, health care, and social services. When power outages occur, these fundamental human needs become at risk.

In this project, we explored a data set that reports “major outages witnessed by different states in the continental U.S. during January 2000–July 2016.” (source.) We specifically want to examine what attributes are correlated with longer duration of major power outages in the U.S.? Understanding this question will be crucial in finding out how to minimize the number of future power outages.

Dataset Introduction

The dataset we will be using provides information on the major power outages that occured in the U.S. from January 2000 to July 2016. Stats-wise, our raw dataset contains 1535 rows, which represent the nunber of major power outage reports, and 56 columns, which represent the number of power outage properties that were recorded.

The following columns will be revelant to our dataset:

Note: descriptions were taken from ScienceDirect

Column Names	Description
`YEAR`	Indicates the year when the outage event occurred
`MONTH`	Indicates the month when the outage event occurred
`U.S._STATE`	Represents all the states in the continental U.S.
`POSTAL.CODE`	Represents the postal code of the U.S. states
`NERC.REGION`	The North American Electric Reliability Corporation (NERC) regions involved in the outage event
`ANOMALY.LEVEL`	This represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W)
`CLIMATE.CATEGORY`	This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
`OUTAGE.START`	This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region) (in Timedelta)
`OUTAGE.RESTORATION.DATE`	This variable indicates the day of the year when power was restored to all the customers (as reported by the corresponding Utility in the region)
`OUTAGE.RESTORATION`	This variable indicates the time of the day when power was restored to all the customers (as reported by the corresponding Utility in the region) (in TimeDelta)
`CAUSE.CATEGORY`	Categories of all the events causing the major power outages
`CAUSE.CATEGORY.DETAIL`	Detailed description of the event categories causing the major power outages
`OUTAGE.DURATION`	Duration of outage events (in minutes)
`CUSTOMERS.AFFECTED`	Number of customers affected by the power outage event
`TOTAL.PRICE`	Average monthly electricity price in the U.S. state (cents/kilowatt-hour)

Cleaning and EDA

The data features 1535 rows and 55 columns. That means that their were 1,534 power outages in the continental U.S. in between the dates of January 2000 and July 2016. A lot of rows have some missing values depending what caused the power outages (for example the column HURRICANE.NAMES is mostly filled with np.nan as only 72 of the power outages were caused by hurricanes which is about ≈4.6% of the data).

Dropping unneccessary columns

Some columns, such as the OBS and variables column, did not contain any information on the actual properties of major power outages, so they were dropped as a result. Some of the outages were also missing the time in which they started and ended. Due to our central question being around the outage duration and their only being 9 rows with missing dates, we decided to drop those rows.

Converting datatypes

Cleaning also included making sure each of the columns were the right type, and if they were not, updating them to their intended types. For instance, the YEAR and MONTH columns were originally floats, but through cleaning they were converted to ints.

Converting to correct datetimes

Certain columns such as the ones regarding .PRICE or PCT_ needed to be converted from objects to floats

Below shows some of the changed data types of the newly cleaned dataframe:

	0
YEAR	int64
MONTH	int64
U.S._STATE	object
POSTAL.CODE	object
RES.PRICE	float64
COM.PRICE	float64
IND.PRICE	float64
TOTAL.PRICE	float64
RES.SALES	float64
PCT_LAND	float64
PCT_WATER_TOT	float64
PCT_WATER_INLAND	float64
SQUARE.MILES.AFFECTED	float64

Below is the head of the DataFrame of the 5 most relevant columns:

	YEAR	MONTH	U.S._STATE	OUTAGE.START.DATE	OUTAGE.DURATION
0	2011	7	Minnesota	2011-07-01 00:00:00	3060
1	2014	5	Minnesota	2014-05-11 00:00:00	1
2	2010	10	Minnesota	2010-10-26 00:00:00	3000
3	2012	6	Minnesota	2012-06-19 00:00:00	2550
4	2015	7	Minnesota	2015-07-18 00:00:00	1740

Exploratory Data Analysis

Univariate Analysis

For the univariate analysis, we decided to look at the columns that were relevent to our research question. In this graph, we see the total number of outages based on their duration.

Note: each grouping of the histogram represents a 12 hour period.

We also looked at whether in different months we see outage of different durations. Here we graphed the mean duration of the outages based on the month they occured in. We can see that outages in the summer last longer on average than the spring and the fall.

Bivariate Analysis

For the bivariate analysis, we looked once more at the outage duration but in relation to their location in the U.S. Below is a map that shows each of the states color coded depending on the average duration.

Note that the bins are not even to better represent the data.

Interesting Aggregates

The following aggregate has CAUSE.CATEGORY as the index and CAUSE.CATEGORY.DETAIL as the columns. DETAIL_MISSING = False represents how many NaNs were proportionality not missing in each CAUSE.CATEGORY. DETAIL_MISSING = False represents how many NaNs were proportionality missing in each CAUSE.CATEGORY. Each column is a separate distribution that adds to 1. This aggregate is significant in investigating whether or not the missingness inCAUSE.CATEGORY.DETAIL depends on CAUSE.CATEGORY, which will be explored in the later section assessing the missingness in the data.

CAUSE.CATEGORY	DETAIL_MISSING = False	DETAIL_MISSING = True
equipment failure	0.0418288	0.0267857
fuel supply emergency	0.0243191	0.0290179
intentional attack	0.347276	0.102679
islanding	nan	0.0982143
public appeal	nan	0.154018
severe weather	0.551556	0.395089
system operability disruption	0.0350195	0.194196

Assessment of Missingness

NMAR Analysis

One column that could have NMAR data is the “CUSTOMERS.AFFECTED” column. This column measures the number of customers affected by the power outage event and contains 420 NaNs. This missingness could be due to customers not reporting their affectedness by the power outage event. If the severity of power outage cause was small, then there would be less damage done and thus customers would feel less affected and less inclined to report on their affectedness, explaining the NaNs.

If we wanted to explain the missingness, information on how the customers were affected (i.e. through a survey or head count of the area) would be needed. Additional information on the type and severity of the power outage cause could be correlated with how the customers affected was reported, thus making the missingness MAR.

Missingness Dependency

Picking a column with non-trivial missingness

To find a column with non-trivial missingness, we must first define what non-trivial missingness is. In this project, we defined a column to have non-trivial missingness as having 20% or more missing data values in that respective column. The following dataframe shows the top 5 columns that have a the greatest proportion of missing values:

column name	missing data amount
HURRICANE.NAMES	0.99
DEMAND.LOSS.MW	0.48
CAUSE.CATEGORY.DETAIL	0.32
CUSTOMERS.AFFECTED	0.3
OUTAGE.RESTORATION.DATE	0.04

After analyzing the table, the column we decided to base our missingness analysis on was the CAUSE.CATEGORY.DETAIL column, which has non-trivial missingness of approximately 32%.

Next, we will use permutation tests to analyze the dependency of the missingness in the CAUSE.CATEGORY.DETAIL column against the following columns: CAUSE.CATEGORY and POPULATION

CAUSE.CATEGORY.DETAIL and CAUSE.CATEGORY (MAR)

Null hypothesis: The missingness in CAUSE.CATEGORY.DETAIL does not depend on CAUSE.CATEGORY

Alternative hypothesis: The missingness in CAUSE.CATEGORY.DETAIL does depend on CAUSE.CATEGORY

Observed test-statistic: Total Variation Distance (TVD)

To recall, CAUSE.CATEGORY contains categories of all the events causing the major power outages, and CAUSE.CATEGORY.DETAIL contains detailed description of the event categories causing the major power outages.

The vertical barplot below displays the distributions of CAUSE.CATEGORY with and without the missingness in CAUSE.CATEGORY.DETAIL. Analyzing the barplot, we notice the distributions are very different.

Next, we continue to conduct a permutation test using the Total Variation Distance (TVD) as our observed test statistic.

After 500 permutations of shuffling the CAUSE.CATEGORY column and simulating the TVD results, the p-value comes to be 0.0. Our p-value of 0.0 is less than our significance level of 0.01, therefore we reject the null hypthesis stating that the missingness in CAUSE.CATEGORY.DETAIL does not depend on CAUSE.CATEGORY, thus making it MAR.

CAUSE.CATEGORY.DETAIL and IND.PRICE (MCAR)

Null hypothesis: The missingness in CAUSE.CATEGORY.DETAIL does not depend on IND.PRICE

Alternative hypothesis: The missingness in CAUSE.CATEGORY.DETAIL does depend on IND.PRICE

Observed test-statistic: Total Variation Distance (TVD)

To recall, IND.PRICE contains the monthly electricity price in the industrial sector (cents/kilowatt-hour), and CAUSE.CATEGORY.DETAIL contains detailed description of the event categories causing the major power outages.

*Note: Since IND.PRICE contains numerical data, we decided to bin the prices into 5 categories to make it categorical, and thus be able to use the Total Variation Distance to run our permutation tests.

QcutBin	DETAIL_MISSING = False	DETAIL_MISSING = True
(3.1990000000000003, 5.48]	0.224708	0.162946
(5.48, 6.3]	0.197471	0.1875
(6.3, 7.32]	0.190661	0.212054
(7.32, 9.114]	0.189689	0.209821
(9.114, 27.85]	0.191634	0.214286
nan	0.00583658	0.0133929

The vertical barplot below displays the distributions of IND.PRICE with and without the missingness in CAUSE.CATEGORY.DETAIL. Analyzing the barplot, we notice the distributions are more similar.

Next, we continue to conduct a permutation test using the Total Variation Distance (TVD) as our observed test statistic.

After 500 permutations of shuffling the IND.PRICE column and simulating the TVD results, the p-value comes to be 1.0. Our p-value of 1.0 is greater than our significance level of 0.01, therefore we fail to reject the null hypthesis stating that the missingness in CAUSE.CATEGORY.DETAIL does not depend on IND.PRICE, thus making it MCAR.

Hypothesis Testing

–

For our hypothesis testing we have chosen two variables that appear to have a correlation with out question to test whether that is true or not. We looked to see if any of our columns had a graphical correlation with OUTAGE.DURATION. After many, many graphs, which all seemed to have very little correlation, we started to make different statistics from the columns we have. One such statistic we made is SQUARE.MILES.AFFECTED, made from dividing CUSTOMERS.AFFECTED by POPDEN_UC

Here is a graph of the two variables, with the line of best fit graphed on it as well.

From this we constructed the following:

Null Hypothesis (H0): There is no correlation between the two variables.

Alternative Hypothesis (HA): There is a correlation between the two variables.

To test for this, we used the Pearson correlation coefficient to measure our hypothesis. This test the linear relationsip between two numerical variables. We set the alpha to equal 0.01

p_value = (np.abs(permuted_correlations) >= np.abs(observed_correlation)).mean()

This returned a p-value of 1e-05

Permutation Test

We ran the permutation test 100,000 times. The blue stack is our distribution and the red line is the observed value.

Hypothesis Testing Conclusion

The P-Value for this hypothesis test is of 0.000, which is less than our alpha of 0.01. This means that we can reject the null hypothesis.

This could be explained by a mutlitude of factors and unknown variables, so we cannot draw any conclusion other that the variables are correlated.

Power Outage Model

Framing the Problem

Introduction

Electricity is essential for daily life, as it is used to maintain many modern-day necessities such as machinery, electronics, public transportation systems, etc. It serves as a basis for food production, clean water, sanitation, education services, health care, social services and other fundamental human needs. When power outages occur, these fundamental human needs become at risk and poses a great issue.

In this project, we use this dataset that reports on all the major power outages in the U.S. from 2000 to 2016. Specifically, we want to use the dataset to create a classification model that will aid us in predicting power outages, specifically predicting the cause type of the power outage.

Classification Model

We want to attempt to predict the CAUSE.CATEGORY using other features found in the dataset (more information on the specific features will be revealed in later parts). Since we are predicting discrete, categorical variables in the CAUSE.CATEGORY column, we will be using classification to model our predictions.

Multiclass Classification

We will be performing a multiclass classification, since the observations in CAUSE.CATEGORY can be classified into 7 different categories (as shown in the value counts series below).

Response Variable

The response variable we chose to predict is CAUSE.CATEGORY, because it had the most complete data to work with, with 0 missing values. We were also more interested in using classification rather than regression.

Model Metric

The F1-score is the metric we used to evaluate our model. Since the CAUSE.CATEGORY column is less balanced and more skewed (as shown in the value counts below), the F1-score is more suitable to evaluate the potential false postives and negatives. The F1-score also considered precision and recall

Data Cleaning

Converting OUTAGE.START and OUTAGE.RESTORATION.DATE

OUTAGE.START.DATE, OUTAGE.START.TIME, OUTAGE.RESTORATION.DATE, OUTAGE.RESTORATION.TIME are all datetime columns. In order to incorporate them into our model, we decided to convert OUTAGE.START.TIME and OUTAGE.RESTORATION.TIME to only include the hour the power outage started and was restored. These are now int columns, and the columns were respectively renamed to START.HOUR and END.HOUR. We also converted OUTAGE.START.DATE and OUTAGE.RESTORATION.DATE to only include the day of the month the power outage started and was restored. These are also now int columns, and the columns were respectively renamed to START.DAY and END.DAY. The new columns are shown below:

	START.DAY	START.HOUR	END.DAY	END.HOUR
0	01	17	03	20
1	11	18	11	18
2	26	20	28	22
3	19	04	20	23
4	18	02	19	07

Filtering Features

Since we can’t choose features that we wouldn’t know during the “time of prediction”, which in our case means during the time the outage’s CAUSE.CATEGORY was reported, we must do additional data cleaning to take out those features from the dataset we’ll use to train our models.

The following features cannot be used with the following reasons:

OBS: This column is redundant to the index column.
POSTAL.CODE: Since we’ve decided to include the U.S._STATE, POSTAL.CODE is redundant.
CAUSE.CATEGORY.DETAIL: This feature can only be recorded after reporting the CAUSE.CATEGORY.
HURRICANE.NAMES: This feature can only be recorded after reporting the CAUSE.CATEGORY.

Filtering Column NaNs

In the dataset, we decided to convert the NaNs in every column except CLIMATE.REGION, ANOMALY.LEVEL, and CLIMATE.CATEGORY to -1 to analyze if the missingness in some columns will affect the outcome of our classiifcation model. We decided to use the value of -1 to fill in the NaNs, since every column except CLIMATE.REGION, ANOMALY.LEVEL, and CLIMATE.CATEGORY do not contain negative values, since it wouldn’t make sense in the context of the columns.

Finally, we removed the rows that contained NaNs from CLIMATE.REGION, ANOMALY.LEVEL, and CLIMATE.CATEGORY columns. This removed 14 rows, which is less than 1% of our total rows. Our dataset has no NaNs now!

Final Cleaned Dataset

Below shows the head of the dataset:

	YEAR	MONTH	U.S._STATE	NERC.REGION	CLIMATE.REGION	ANOMALY.LEVEL	CLIMATE.CATEGORY	CAUSE.CATEGORY	OUTAGE.DURATION	DEMAND.LOSS.MW	CUSTOMERS.AFFECTED	RES.PRICE	COM.PRICE	IND.PRICE	TOTAL.PRICE	RES.SALES	COM.SALES	IND.SALES	TOTAL.SALES	RES.PERCEN	COM.PERCEN	IND.PERCEN	RES.CUSTOMERS	COM.CUSTOMERS	IND.CUSTOMERS	TOTAL.CUSTOMERS	RES.CUST.PCT	COM.CUST.PCT	IND.CUST.PCT	PC.REALGSP.STATE	PC.REALGSP.USA	PC.REALGSP.REL	PC.REALGSP.CHANGE	UTIL.REALGSP	TOTAL.REALGSP	UTIL.CONTRI	PI.UTIL.OFUSA	POPULATION	POPPCT_URBAN	POPPCT_UC	POPDEN_URBAN	POPDEN_UC	POPDEN_RURAL	AREAPCT_URBAN	AREAPCT_UC	PCT_LAND	PCT_WATER_TOT	PCT_WATER_INLAND	START.DAY	START.HOUR	END.DAY	END.HOUR
0	2011	7	Minnesota	MRO	East North Central	-0.3	normal	severe weather	3060	-1	70000	11.6	9.18	6.81	9.28	2332915	2114774	2113291	6562520	35.5491	32.225	32.2024	2.30874e+06	276286	10673	2.5957e+06	88.9448	10.644	0.411181	51268	47586	1.07738	1.6	4802	274182	1.75139	2.2	5.34812e+06	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874	01	17	03	20
1	2014	5	Minnesota	MRO	East North Central	-0.1	normal	intentional attack	1	-1	-1	12.12	9.71	6.49	9.28	1586986	1807756	1887927	5284231	30.0325	34.2104	35.7276	2.34586e+06	284978	9898	2.64074e+06	88.8335	10.7916	0.37482	53499	49091	1.08979	1.9	5226	291955	1.79	2.2	5.45712e+06	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874	11	18	11	18
2	2010	10	Minnesota	MRO	East North Central	-1.5	cold	severe weather	3000	-1	70000	10.87	8.19	6.07	8.15	1467293	1801683	1951295	5222116	28.0977	34.501	37.366	2.30029e+06	276463	10150	2.58690e+06	88.9206	10.687	0.392361	50447	47287	1.06683	2.7	4571	267895	1.70627	2.1	5.3109e+06	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874	26	20	28	22
3	2012	6	Minnesota	MRO	East North Central	-0.1	normal	severe weather	2550	-1	68200	11.79	9.25	6.71	9.19	1851519	1941174	1993026	5787064	31.9941	33.5433	34.4393	2.31734e+06	278466	11010	2.60681e+06	88.8954	10.6822	0.422355	51598	48156	1.07148	0.6	5364	277627	1.93209	2.2	5.38044e+06	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874	19	04	20	23
4	2015	7	Minnesota	MRO	East North Central	1.2	warm	severe weather	1740	250	250000	13.07	10.16	7.74	10.43	2028875	2161612	1777937	5970339	33.9826	36.2059	29.7795	2.37467e+06	289044	9812	2.67353e+06	88.8216	10.8113	0.367005	54431	49844	1.09203	1.7	4873	292023	1.6687	2.2	5.48959e+06	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874	18	02	19	07

Baseline Model

Model Description

Our baseline model uses K-Nearest Neighbors (KNN) algorithm to analyze the following features on CAUSE.CATEGORY: OUTAGE.DURATION, YEAR, MONTH, U.S._STATE, TOTAL.PRICE, TOTAL.SALES, TOTAL.CUSTOMERS, AREAPCT_URBAN.

Feature Descriptions

Here are the descriptions for each feature (as taken from the dataset website):

OUTAGE.DURATION: Duration of outage events (in minutes)
YEAR: Indicates the year when the outage event occurred
MONTH: Indicates the month when the outage event occurred
U.S._STATE: Represents all the states in the continental U.S.
TOTAL.PRICE: Average monthly electricity price in the U.S. state (cents/kilowatt-hour)
TOTAL.SALES: Total electricity consumption in the U.S. state (megawatt-hour)
TOTAL.CUSTOMERS: Annual number of total customers served in the U.S. state
AREAPCT_URBAN: Percentage of the land area of the U.S. state represented by the land area of the urban areas (in %)

Here are the labels for the data found in each feature:

Quantitative: OUTAGE.DURATION, TOTAL.PRICE, TOTAL.SALES, TOTAL.CUSTOMERS, AREAPCT_URBAN
Nominal: U.S._STATE
Ordinal: YEAR, MONTH

*Note: Although YEAR and MONTH can argue to be either quantative or categorical, in the context of our classifier model, we are treating YEAR and MONTH as categorical variables, therefore they are considered to be categorical and ordinal.

Overall, there are 5 quantitative variables, 1 nominal variable, and 2 ordinal variables.

Feature Transformations

For this model, we One-Hot Encoded the categorical features U.S._STATE, YEAR, and MONTH to convert them to numerical form using OneHotEncoder(). One-Hot Encoding is when we convert each unique value in the categorical feature into a binary, with 1 representing the feature is present and 0 representing the feature is not.

We also used standardized the quantiative feature TOTAL.SALES using StandardScaler(). This standardizes the total electricity consumption reported in TOTAL.SALES by using the z-score (a.k.a. removing the mean and scaling to a unit variance). We decided to standarize because we believe the distribution of TOTAL.SALES is not normal.

Model Performance

To assess our KNN model performance, we decided to use a Confusion Matrix to analyze the amount of false positives and false negatives our model got. Overall, our model had a 66.32% training accuracy and a 63.95% testing accuracy.

Based on our training and testing accuracy, our current model is not that good. Both the training and testing accuracies average around 63 to 66%, which means our model uses the given features to predict the correct CAUSE.CATEGORY only a bit more than half the time. We decided that a model with 80% accuracy or above is considered to be good, so our current model falls below the threshold and is not considered adequate enough.

To describe more on the Confusion Matrix shown below, our Confusion Matrix reported our model to have a precision score of 0.638 and a recall score of 0.639. There are 243 total true positives (TP) and 108 false positives (FP).

Final Model

Model Description

For our final model, we chose to use the RandomForestClassifier. In order to improve our accuracy, we want to fine-tune the hyperparameters in our model with much detail, and we found the RandomForestClassifier does the best in allowing for a lot of hyperparameters to be tuned (as compared to a regular linear regression). Since we are trying to predict the categorical feature CAUSE.CATEGORY, this was another reason why we decided to use the classification method that is RandomForestClassifier.

We used a hyperparameter search, and it resulted in a depth of 15 and in the grid search we also found that using entropy gave us better results.

Below is a graph of F1 train error and F1 test error based on max_depth. We picked a max_depth of 15 as it has the lowest validation error.

Feature Descriptions

Here are the descriptions and reasoning for each feature (descriptions taken from the dataset website):

CUSTOMERS.AFFECTED: Represents the number of customers affected by the power outage event. If the number of customers affected is too high or too low, this could provide insight into what type of power outage causes that specific amount of customers to be affected.
TOTAL.REALGSP: Represents the real gross state product (GSP) contributed by all industries (total) (measured in 2009 chained U.S. dollars). If we know how much GSP each industry contributes, coupled with knowing where those industries are located in the U.S., this knowledge could influence certain non weather-based cause categories to happen more or less (i.e. intentional attack, equipment failure).
TOTAL.PRICE: Represents the average monthly electricity price in the U.S. state (cents/kilowatt-hour). If the electricity price is lower or higher, then non weather-based power outages (i.e. intentional attack, equipment failure) could be influenced.
ANOMALY.LEVEL: Represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. El Nino and La Nina are extreme weather phenomenons that happen around every 3 years. Knowing when they occur can influence what specific weather-based causes of power outages happen.
OUTAGE.DURATION: Represents the duration of outage events (in minutes). If the duration of the outage is longer or shorter, then that can influence the type of power outage cause. For instance, if a duration is shorter, then it’s probably not associated with an extreme-weather based cause.
RES.PERCEN: Represents the percentage of residential electricity consumption compared to the total electricity consumption in the state (in %). We wanted to not only focus on the generla electricity consumption, but were rather curious what the residential percentage of electricity consumption can influence the causes. The amount of residential electricity consumption can influence non weather-based causes.
MONTH: Represents the month when the outage event occurred. Certain power outage causes, especially those that are weather-based, can be correlated with the month the power outage was in, because the month can tell us the whether at the time of the month too.

Feature Transformations

Here are the feature transformations used:

One-Hot Encode: U.S._STATE, CLIMATE.CATEGORY
Standard Scalar: TOTAL.PRICE, OUTAGE.DURATION

Model Performance

Overall, our final model performed much better than our baseline model, with a new training accuracy of 100% (which is 1 - our training error of 0.0) and a new testing accuracy of 81.84% (which is 1 - our testing error of 0.18157894736842106). Our model metric, F1-score, reported to have a training error of 0.0 and testing error of 0.156349037818392. This is already a drastic improvement compared to our baseline model’s testing and training accuracies, which landed in the 60% range. However something to take note is that our training accuracy is 100% and our F1-score training error is 0%, which could indicate that we overfit the model.

Fairness Analysis

For our fairness analysis we decided to go with F1 score as it incoporates the fact that our data is uneavenly distributed. We decided to use CUSTOMERS.AFFECTED as the column to split into two groups. We split our columns into two groups, group 1 being less than 10,000 people affected and group 2 being more than 10,000. The null hypothesis is that our model is fair and the f1 score for both groups of CUSTOMER.AFFECTED is the same. The alternate hypothesis is that our model is unfair and the f1 score for both groups of CUSTOMER.AFFECTED is greater for group 2. We picked a significance level of 0.01. We ran the permutation 1000 times and it resulted in a p-value of 0.713. Hence we fail to reject the null hypothesis. We cannot say for sure that our test is fair as the results could be due to random chance. Below is a graph to display our distribution.

(Note: the graph behaved weirdly as the distribution of values is very small and very close to 0)