COVID19’s effect on airport traffic: A Data Analysis in R
Download the data from here.
Load your data (csv format)
data = read.csv("location/filename.csv")
Exploration of Data
Observing the data
head(data)
tail(data)
Variable Types
str(data)
Understanding the Data
Important Variables
Numerical Variables
- PercentOfBaseline
Categorical Variables
- Date
- AirportName
- Centroid (of airports)
- City
- State
- Country
- Geography (of countries)
Comments
- This is time-series data.
- The centroid and geography will be only helpful if we want to visualize the data on the world map.
- Country → State → City → AirportName are factor variables inside the next one.
Focus
- Data Visualization is the main focus of the project.
- Due to the structure of Country → State → City → AirportName, the best way to visualize it in an app with dropdown options. (R Shiny App, Tableau, Power BI, etc.)
- Analyze each of the individual time series and forecast for the next “n” number of days.
- Data visualization of categorical variables and their relationship is not important, because they are interrelated and give no valuable insight.
Exploratory Data Analysis
Univariate (Kernel Density of PercentOfBaseline)
library(ggplot2)
ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)
Exercise: Plot individual kernel density of each factor of Country, State, City, and AirportName.
Bivariate
Numerical vs Categorical
- PercentofBaseline vs Country
ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)
Observations:
- All the countries are behaving like normal mixtures with multiple modes.
- Australia has two distinct modes of normal mixtures. (Why?)
- Canada has a higher PercentOfBaseline than most of the other countries.
Exercise: Plot and compare kernel density of PercentOfBaseline of factors of Country, State, City, and AirportName.
- PercentofBaseline vs Date (Time Series)
ggplot(data, aes(x =Date, y = PercentOfBaseline, group = Country, colour = Country)) + geom_line() +facet_wrap(~ Country)
Observations:
- Except for Australia, the other countries have an increasing trend.
- Australia has two distinct modes of normal mixtures because, over time, the PercentOfBaseline has a major shift.
Exercise: Plot and compare the time series of PercentofBaseline of Country, State, City, and AirportName.
Time Series Forecast
Time Series is a specific data structure in R.
We have to convert data into time series data structures to apply time series algorithms.
Chile Data
library(dplyr)
data_chile = data %>% filter(Country == "Chile")
chile = data_chile[,c(2,5)]
head(chile)
Convert into time-series data structure*
- 1,2,3,4,… as time index with no NA values.
library(zoo)
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...
ts_chile = as.ts(z) #conversion into time series data structure
head(ts_chile)
- Date as a time index with NA values to be removed.
library(zoo)
library(tseries)#for removing na from ts()
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
ts_chile = as.ts(z)
ts_chile = na.remove(ts_chile)#remove na from time series
head(ts_chile)
If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.
Plot the time series
plot(ts_chile)
Two parts of the Time Series, we’ll discuss are
- Decomposition (Trend + Seasonal + Error)
- Forecasting
Decomposition
decompose(ts_chile)
This is happening because the data has no seasonal trend
Forecast
Model (ARIMA model)
library(forecast)
model = auto.arima(ts_chile)
model
ACF Plot
acf(model$residuals, main = "Correlogram")
PACF Plot
pacf(model$residuals, main = "Partial Correlogram")
Ljung Box Test
The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.
Box.test(model$residuals, type = "Ljung-Box")
We do not reject the hypothesis, that the model shows a good fit.
Normality of Residuals
hist(model$residuals, freq = FALSE)
lines(density(model$residuals))
Forecast
forecast = forecast(model,4)#4 = number of units you want to
library(ggplot2)
autoplot(forecast) #plot of the model
accuracy(forecast) #performance of the model
forecast #forecast values
Exercise: Apply it to other countries as well.
All the best. :D