A Forecast of U.S. Solar Energy Generation

Made with R Markdown. See the code on my GitHub Page

This is a timeseries analysis of the total solar electricity generation in the U.S. based on data from the U.S. Energy Information Administration. The data contains monthly totals from 1973-2017 by energy source, with units in quadrillions of British Thermal Units. I converted the units to gigawatt-hours, which I think is more common for electricity.

My goal is to fit an ARIMA timeseries model, and make 12-month and 5-year forecasts. This isn’t a rigorous study based on extensive domain knowledge. That would certainly help, but for now I am focusing on modeling. This is an exercise in building an ARIMA model. Before we dive into model building, here is a plot of all the other sources including solar, for some context.

If we look carefully, solar energy is one of the smallest sources, shown as a line on the bottom right of the plot. It is dwarfed by the major energy sources: natural gas, coal, and oil. Nevertheless, solar is experiencing exponential growth. We take a closer look at solar and wind, two renewable sources, in the next plot.

Solar and Wind energy generation are growing exponentially. Wind energy begins its exponential growth in about year 2000, and solar begins its boom roughly in year 2010. This is great, but how much can we expect in 1 year, or 5 years? Maybe an ARIMA model can give a good guess.

First we need to do some transformations on the data. The growth is exponential, which is hard to work with. We can make this easier by applying a logarithmic transformation. This means that we are modeling the growth rate, rather than growth directly. Below is a plot of the log transformed solar generation:

The result is a nice constant trend that oscillates with the seasons. We can’t model this with ARIMA yet, because it has a trend. ARIMA models assume a “stationary process” (the mean and variance do not change over time). Fear not, we can remove the trend with another transformation–differencing, where the previous value is subtracted from each observation. This is reversible so long as we know the first observation. Below is the differenced series:

Now it looks like a stationary process, great for fitting ARIMA. To begin, We need to come up with candidate parameters. We can find those by looking at auto-correlation (ACF) and partial-auto-correlation (PACF) plots. The parameters in question are the auto-regressive (AR), moving average (MA), and seasonal parameters. For details on this, I refer to Time Series Analysis and its Applications by Robert H. Shumway and David S. Stoffer.

In the above ACF plot, we see a pattern of strong peaks at lag years 1, 2, 3… slowly decaying. This may indicate some seasonal non-stationarity, which could be fixed with seasonal differencing. We will consider this in subsequent models.
In the PACF plot, we see strong autocorrelation at the 12 month lag (1.0 on the plot), which cuts off, indicating a seasonal ARIMA model of order P = 1 may be helpful. Below, we fit an ARIMA model with order P = 1, with differencing D = 0 and d = 1:

The diagnostic plots are encouraging. The residual time plot shows no obvious patterns, with no outliers greater than 3 standard deviations from zero. The residual distribution in the Q-Q plot is very close to normal, despite heavy tails. Two concerns are the residual ACF plot showing autocorrelation in the 12th lag, which indicates a seasonal moving average component of order Q = 1 may be helpful. The PACF plot shows a peak at lag 12, so we may also consider another seasonal auto-regressive order like P = 2. Finally, the p-values of the Ljung-Box statistics indicate some departure from the independence assumption of the residuals.

Next, we will compare a series of models using information criterea:
Fit 3: ARIMA\((0,1,0) \times (1,0,1)_{12}\)
Fit 4: ARIMA\((0,1,0) \times (2,0,1)_{12}\)
Fit 5: ARIMA\((0,1,0) \times (1,1,1)_{12}\)
Fit 6: ARIMA\((0,1,0) \times (2,1,1)_{12}\)
Fit 7: ARIMA\((1,1,0) \times (1,1,1)_{12}\)
Fit 8: ARIMA\((1,1,1) \times (1,1,1)_{12}\)
Fit 9: ARIMA\((0,1,1) \times (1,1,1)_{12}\)

We see in the dot chart above that fit 9 has the smallest AIC value. Fit 9 is ARIMA\((0,1,1) \times (1,1,1)_{12}\). Below is a plot of its diagnostics:

The diagnostic plots look reasonable for this model, with the exception of two outliers among the standardized residuals. With that kept in mind, We will move forward with this model because it looks good enough. As a diagnostic, below is a 12 month forecast of the mean, using the last 12 months as hold out.

The model predicts the holdout period pretty well!

Now we can can extrapolate to the actual future, and forecast the energy output 5 years out from June 2017:

Small note: I’m using 89% confidence intervals for predictions. Why 89%? I’ve taken the idea from Prof. Richard McElreath, to underline how arbitrary the choice is. 89 is a prime number, so that may as well be the reason. The point of using these confidence intervals is to quantify our uncertainty about where the future values will be in the context of our model.

As we see above, our ARIMA model seems to provide a believable forecast for the growth of solar energy generation. It predicts a continuation of exponential growth. This should be taken cautiously, because solar energy will likely experience changes in growth as our energy infrastructure changes from fossil fuel sources. In the near term however, I hope the model is accurate.

By June of the year 2022, the model predicts with 89% confidence that the energy generation will be between 78341.97 and 203573.46 gigawatt-hours, which is a 200 to 687 percent increase. In 2018, one year out from the end of the data, the model predicts between 29448.10 to 42313.16 gigawatt-hours, which is a 13.8 to 63.6 percent increase from a year before.

Thanks for reading, and thanks to Prof. Kerr of CSU East Bay for introducing me to this stuff.

A Forecast of U.S. Solar Energy Generation

David Noonan

2017-11-15