COVID19’s effect on airport traffic: A Data Analysis in R

Published in

Srijit Mukherjee Content

4 min readMay 21, 2021

Download the data from here.

Load your data (csv format)

data = read.csv("location/filename.csv")

Exploration of Data

Observing the data

head(data) 
tail(data)

Variable Types

str(data)

Understanding the Data

Important Variables

Numerical Variables

PercentOfBaseline

Categorical Variables

Date
AirportName
Centroid (of airports)
City
State
Country
Geography (of countries)

Comments

This is time-series data.
The centroid and geography will be only helpful if we want to visualize the data on the world map.
Country → State → City → AirportName are factor variables inside the next one.

Focus

Data Visualization is the main focus of the project.
Due to the structure of Country → State → City → AirportName, the best way to visualize it in an app with dropdown options. (R Shiny App, Tableau, Power BI, etc.)
Analyze each of the individual time series and forecast for the next “n” number of days.
Data visualization of categorical variables and their relationship is not important, because they are interrelated and give no valuable insight.

Exploratory Data Analysis

Univariate (Kernel Density of PercentOfBaseline)

library(ggplot2)
ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)

Exercise: Plot individual kernel density of each factor of Country, State, City, and AirportName.

Bivariate

Numerical vs Categorical

- PercentofBaseline vs Country

ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)

Observations:

All the countries are behaving like normal mixtures with multiple modes.
Australia has two distinct modes of normal mixtures. (Why?)
Canada has a higher PercentOfBaseline than most of the other countries.

Exercise: Plot and compare kernel density of PercentOfBaseline of factors of Country, State, City, and AirportName.

- PercentofBaseline vs Date (Time Series)

ggplot(data, aes(x =Date, y = PercentOfBaseline, group = Country, colour = Country)) + geom_line() +facet_wrap(~ Country)

Observations:

Except for Australia, the other countries have an increasing trend.
Australia has two distinct modes of normal mixtures because, over time, the PercentOfBaseline has a major shift.

Exercise: Plot and compare the time series of PercentofBaseline of Country, State, City, and AirportName.

Time Series Forecast

Time Series is a specific data structure in R.

We have to convert data into time series data structures to apply time series algorithms.

Chile Data

library(dplyr)
data_chile = data %>% filter(Country == "Chile") 
chile = data_chile[,c(2,5)]
head(chile)

Convert into time-series data structure*

1,2,3,4,… as time index with no NA values.

library(zoo)
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...
ts_chile = as.ts(z) #conversion into time series data structure
head(ts_chile)

Date as a time index with NA values to be removed.

library(zoo)
library(tseries)#for removing na from ts()
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
ts_chile = as.ts(z) 
ts_chile = na.remove(ts_chile)#remove na from time series
head(ts_chile)

If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.

Plot the time series

plot(ts_chile)

Two parts of the Time Series, we’ll discuss are

Decomposition (Trend + Seasonal + Error)
Forecasting

Decomposition

decompose(ts_chile)

This is happening because the data has no seasonal trend

Forecast

Model (ARIMA model)

library(forecast)
model = auto.arima(ts_chile)
model

ACF Plot

acf(model$residuals, main = "Correlogram")

PACF Plot

pacf(model$residuals, main = "Partial Correlogram")

Ljung Box Test

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.

Box.test(model$residuals, type = "Ljung-Box")

We do not reject the hypothesis, that the model shows a good fit.

Normality of Residuals

hist(model$residuals, freq = FALSE)
lines(density(model$residuals))

Forecast

forecast = forecast(model,4)#4 = number of units you want to 
library(ggplot2) 
autoplot(forecast) #plot of the model

accuracy(forecast) #performance of the model

forecast #forecast values

Exercise: Apply it to other countries as well.

All the best. :D

COVID19’s effect on airport traffic: A Data Analysis in R

Load your data (csv format)

Exploration of Data

Observing the data

Variable Types

Understanding the Data

Important Variables

Numerical Variables

Categorical Variables

Comments

Focus

Exploratory Data Analysis

Univariate (Kernel Density of PercentOfBaseline)

Bivariate

Numerical vs Categorical

- PercentofBaseline vs Country

Observations:

- PercentofBaseline vs Date (Time Series)

Observations:

Time Series Forecast

Chile Data

Convert into time-series data structure*

Plot the time series

Decomposition

Forecast

Model (ARIMA model)

ACF Plot

PACF Plot

Ljung Box Test

Normality of Residuals

Forecast

Written by Srijit Mukherjee