Exploration and Analysis of Power Outages in the Continental United States
by Tom Hocquet and Julia Ma
Introduction
The topic of electricity is important to understand, as it is something that is now essential to daily life. Electricity is used to maintain machinery, electronics, public transportation systems, etc. Electricity also serves as a basis for satisfying fundamental human needs, such as food production, clean water, sanitation, education services, health care, and social services. When power outages occur, these fundamental human needs become at risk.
In this project, we explored a data set that reports “major outages witnessed by different states in the continental U.S. during January 2000–July 2016.” (source.) We specifically want to examine what attributes are correlated with longer duration of major power outages in the U.S.? Understanding this question will be crucial in finding out how to minimize the number of future power outages.
Dataset Introduction
The dataset we will be using provides information on the major power outages that occured in the U.S. from January 2000 to July 2016. Stats-wise, our raw dataset contains 1535 rows, which represent the nunber of major power outage reports, and 56 columns, which represent the number of power outage properties that were recorded.
The following columns will be revelant to our dataset:
Note: descriptions were taken from ScienceDirect
Column Names | Description |
---|---|
YEAR |
Indicates the year when the outage event occurred |
MONTH |
Indicates the month when the outage event occurred |
U.S._STATE |
Represents all the states in the continental U.S. |
POSTAL.CODE |
Represents the postal code of the U.S. states |
NERC.REGION |
The North American Electric Reliability Corporation (NERC) regions involved in the outage event |
ANOMALY.LEVEL |
This represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W) |
CLIMATE.CATEGORY |
This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI) |
OUTAGE.START |
This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region) (in Timedelta) |
OUTAGE.RESTORATION.DATE |
This variable indicates the day of the year when power was restored to all the customers (as reported by the corresponding Utility in the region) |
OUTAGE.RESTORATION |
This variable indicates the time of the day when power was restored to all the customers (as reported by the corresponding Utility in the region) (in TimeDelta) |
CAUSE.CATEGORY |
Categories of all the events causing the major power outages |
CAUSE.CATEGORY.DETAIL |
Detailed description of the event categories causing the major power outages |
OUTAGE.DURATION |
Duration of outage events (in minutes) |
CUSTOMERS.AFFECTED |
Number of customers affected by the power outage event |
TOTAL.PRICE |
Average monthly electricity price in the U.S. state (cents/kilowatt-hour) |
Cleaning and EDA
The data features 1535 rows and 55 columns. That means that their were 1,534 power outages in the continental U.S. in between the dates of January 2000 and July 2016. A lot of rows have some missing values depending what caused the power outages (for example the column HURRICANE.NAMES
is mostly filled with np.nan as only 72 of the power outages were caused by hurricanes which is about ≈4.6% of the data).
Dropping unneccessary columns
Some columns, such as the OBS
and variables
column, did not contain any information on the actual properties of major power outages, so they were dropped as a result. Some of the outages were also missing the time in which they started and ended. Due to our central question being around the outage duration and their only being 9 rows with missing dates, we decided to drop those rows.
Converting datatypes
Cleaning also included making sure each of the columns were the right type, and if they were not, updating them to their intended types. For instance, the YEAR
and MONTH
columns were originally floats, but through cleaning they were converted to ints.
Converting to correct datetimes
Certain columns such as the ones regarding .PRICE
or PCT_
needed to be converted from objects to floats
Below shows some of the changed data types of the newly cleaned dataframe:
0 | |
---|---|
YEAR | int64 |
MONTH | int64 |
U.S._STATE | object |
POSTAL.CODE | object |
RES.PRICE | float64 |
COM.PRICE | float64 |
IND.PRICE | float64 |
TOTAL.PRICE | float64 |
RES.SALES | float64 |
PCT_LAND | float64 |
PCT_WATER_TOT | float64 |
PCT_WATER_INLAND | float64 |
SQUARE.MILES.AFFECTED | float64 |
Below is the head of the DataFrame of the 5 most relevant columns:
YEAR | MONTH | U.S._STATE | OUTAGE.START.DATE | OUTAGE.DURATION | |
---|---|---|---|---|---|
0 | 2011 | 7 | Minnesota | 2011-07-01 00:00:00 | 3060 |
1 | 2014 | 5 | Minnesota | 2014-05-11 00:00:00 | 1 |
2 | 2010 | 10 | Minnesota | 2010-10-26 00:00:00 | 3000 |
3 | 2012 | 6 | Minnesota | 2012-06-19 00:00:00 | 2550 |
4 | 2015 | 7 | Minnesota | 2015-07-18 00:00:00 | 1740 |
Exploratory Data Analysis
Univariate Analysis
For the univariate analysis, we decided to look at the columns that were relevent to our research question. In this graph, we see the total number of outages based on their duration.
Note: each grouping of the histogram represents a 12 hour period.
We also looked at whether in different months we see outage of different durations. Here we graphed the mean duration of the outages based on the month they occured in. We can see that outages in the summer last longer on average than the spring and the fall.
Bivariate Analysis
For the bivariate analysis, we looked once more at the outage duration but in relation to their location in the U.S. Below is a map that shows each of the states color coded depending on the average duration.
Note that the bins are not even to better represent the data.
Interesting Aggregates
The following aggregate has CAUSE.CATEGORY
as the index and CAUSE.CATEGORY.DETAIL
as the columns. DETAIL_MISSING = False
represents how many NaNs were proportionality not missing in each CAUSE.CATEGORY
. DETAIL_MISSING = False
represents how many NaNs were proportionality missing in each CAUSE.CATEGORY
. Each column is a separate distribution that adds to 1. This aggregate is significant in investigating whether or not the missingness inCAUSE.CATEGORY.DETAIL
depends on CAUSE.CATEGORY
, which will be explored in the later section assessing the missingness in the data.
CAUSE.CATEGORY | DETAIL_MISSING = False | DETAIL_MISSING = True |
---|---|---|
equipment failure | 0.0418288 | 0.0267857 |
fuel supply emergency | 0.0243191 | 0.0290179 |
intentional attack | 0.347276 | 0.102679 |
islanding | nan | 0.0982143 |
public appeal | nan | 0.154018 |
severe weather | 0.551556 | 0.395089 |
system operability disruption | 0.0350195 | 0.194196 |
Assessment of Missingness
NMAR Analysis
One column that could have NMAR data is the “CUSTOMERS.AFFECTED” column. This column measures the number of customers affected by the power outage event and contains 420 NaNs. This missingness could be due to customers not reporting their affectedness by the power outage event. If the severity of power outage cause was small, then there would be less damage done and thus customers would feel less affected and less inclined to report on their affectedness, explaining the NaNs.
If we wanted to explain the missingness, information on how the customers were affected (i.e. through a survey or head count of the area) would be needed. Additional information on the type and severity of the power outage cause could be correlated with how the customers affected was reported, thus making the missingness MAR.
Missingness Dependency
Picking a column with non-trivial missingness
To find a column with non-trivial missingness, we must first define what non-trivial missingness is. In this project, we defined a column to have non-trivial missingness as having 20% or more missing data values in that respective column. The following dataframe shows the top 5 columns that have a the greatest proportion of missing values:
column name | missing data amount |
---|---|
HURRICANE.NAMES | 0.99 |
DEMAND.LOSS.MW | 0.48 |
CAUSE.CATEGORY.DETAIL | 0.32 |
CUSTOMERS.AFFECTED | 0.3 |
OUTAGE.RESTORATION.DATE | 0.04 |
After analyzing the table, the column we decided to base our missingness analysis on was the CAUSE.CATEGORY.DETAIL
column, which has non-trivial missingness of approximately 32%.
Next, we will use permutation tests to analyze the dependency of the missingness in the CAUSE.CATEGORY.DETAIL
column against the following columns: CAUSE.CATEGORY
and POPULATION
CAUSE.CATEGORY.DETAIL and CAUSE.CATEGORY (MAR)
Null hypothesis: The missingness in CAUSE.CATEGORY.DETAIL
does not depend on CAUSE.CATEGORY
Alternative hypothesis: The missingness in CAUSE.CATEGORY.DETAIL
does depend on CAUSE.CATEGORY
Observed test-statistic: Total Variation Distance (TVD)
To recall, CAUSE.CATEGORY
contains categories of all the events causing the major power outages, and CAUSE.CATEGORY.DETAIL
contains detailed description of the event categories causing the major power outages.
The vertical barplot below displays the distributions of CAUSE.CATEGORY
with and without the missingness in CAUSE.CATEGORY.DETAIL
. Analyzing the barplot, we notice the distributions are very different.
Next, we continue to conduct a permutation test using the Total Variation Distance (TVD) as our observed test statistic.
After 500 permutations of shuffling the CAUSE.CATEGORY
column and simulating the TVD results, the p-value comes to be 0.0. Our p-value of 0.0 is less than our significance level of 0.01, therefore we reject the null hypthesis stating that the missingness in CAUSE.CATEGORY.DETAIL
does not depend on CAUSE.CATEGORY
, thus making it MAR.
CAUSE.CATEGORY.DETAIL and IND.PRICE (MCAR)
Null hypothesis: The missingness in CAUSE.CATEGORY.DETAIL
does not depend on IND.PRICE
Alternative hypothesis: The missingness in CAUSE.CATEGORY.DETAIL
does depend on IND.PRICE
Observed test-statistic: Total Variation Distance (TVD)
To recall, IND.PRICE
contains the monthly electricity price in the industrial sector (cents/kilowatt-hour), and CAUSE.CATEGORY.DETAIL
contains detailed description of the event categories causing the major power outages.
*Note: Since IND.PRICE
contains numerical data, we decided to bin the prices into 5 categories to make it categorical, and thus be able to use the Total Variation Distance to run our permutation tests.
QcutBin | DETAIL_MISSING = False | DETAIL_MISSING = True |
---|---|---|
(3.1990000000000003, 5.48] | 0.224708 | 0.162946 |
(5.48, 6.3] | 0.197471 | 0.1875 |
(6.3, 7.32] | 0.190661 | 0.212054 |
(7.32, 9.114] | 0.189689 | 0.209821 |
(9.114, 27.85] | 0.191634 | 0.214286 |
nan | 0.00583658 | 0.0133929 |
The vertical barplot below displays the distributions of IND.PRICE
with and without the missingness in CAUSE.CATEGORY.DETAIL
. Analyzing the barplot, we notice the distributions are more similar.
Next, we continue to conduct a permutation test using the Total Variation Distance (TVD) as our observed test statistic.
After 500 permutations of shuffling the IND.PRICE
column and simulating the TVD results, the p-value comes to be 1.0. Our p-value of 1.0 is greater than our significance level of 0.01, therefore we fail to reject the null hypthesis stating that the missingness in CAUSE.CATEGORY.DETAIL
does not depend on IND.PRICE
, thus making it MCAR.
Hypothesis Testing
–
For our hypothesis testing we have chosen two variables that appear to have a correlation with out question to test whether that is true or not. We looked to see if any of our columns had a graphical correlation with OUTAGE.DURATION
. After many, many graphs, which all seemed to have very little correlation, we started to make different statistics from the columns we have. One such statistic we made is SQUARE.MILES.AFFECTED
, made from dividing CUSTOMERS.AFFECTED
by POPDEN_UC
Here is a graph of the two variables, with the line of best fit graphed on it as well.
From this we constructed the following:
Null Hypothesis (H0): There is no correlation between the two variables.
Alternative Hypothesis (HA): There is a correlation between the two variables.
To test for this, we used the Pearson correlation coefficient to measure our hypothesis. This test the linear relationsip between two numerical variables. We set the alpha to equal 0.01
p_value = (np.abs(permuted_correlations) >= np.abs(observed_correlation)).mean()
This returned a p-value of 1e-05
Permutation Test
We ran the permutation test 100,000 times. The blue stack is our distribution and the red line is the observed value.
Hypothesis Testing Conclusion
The P-Value for this hypothesis test is of 0.000, which is less than our alpha of 0.01. This means that we can reject the null hypothesis.
This could be explained by a mutlitude of factors and unknown variables, so we cannot draw any conclusion other that the variables are correlated.