COVID19’s effect on airport traffic: A Data Analysis in R

Srijit Mukherjee
Srijit Mukherjee Content
4 min readMay 21, 2021

--

Download the data from here.

Load your data (csv format)

data = read.csv("location/filename.csv")

Exploration of Data

Observing the data

head(data) 
tail(data)

Variable Types

str(data)

Understanding the Data

Important Variables

Numerical Variables

  • PercentOfBaseline

Categorical Variables

  • Date
  • AirportName
  • Centroid (of airports)
  • City
  • State
  • Country
  • Geography (of countries)

Comments

  • This is time-series data.
  • The centroid and geography will be only helpful if we want to visualize the data on the world map.
  • Country → State → City → AirportName are factor variables inside the next one.

Focus

  • Data Visualization is the main focus of the project.
  • Due to the structure of Country → State → City → AirportName, the best way to visualize it in an app with dropdown options. (R Shiny App, Tableau, Power BI, etc.)
  • Analyze each of the individual time series and forecast for the next “n” number of days.
  • Data visualization of categorical variables and their relationship is not important, because they are interrelated and give no valuable insight.

Exploratory Data Analysis

Univariate (Kernel Density of PercentOfBaseline)

library(ggplot2)
ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)

Exercise: Plot individual kernel density of each factor of Country, State, City, and AirportName.

Bivariate

Numerical vs Categorical

- PercentofBaseline vs Country

ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)

Observations:

  • All the countries are behaving like normal mixtures with multiple modes.
  • Australia has two distinct modes of normal mixtures. (Why?)
  • Canada has a higher PercentOfBaseline than most of the other countries.

Exercise: Plot and compare kernel density of PercentOfBaseline of factors of Country, State, City, and AirportName.

- PercentofBaseline vs Date (Time Series)

ggplot(data, aes(x =Date, y = PercentOfBaseline, group = Country, colour = Country)) + geom_line() +facet_wrap(~ Country)

Observations:

  • Except for Australia, the other countries have an increasing trend.
  • Australia has two distinct modes of normal mixtures because, over time, the PercentOfBaseline has a major shift.

Exercise: Plot and compare the time series of PercentofBaseline of Country, State, City, and AirportName.

Time Series Forecast

Time Series is a specific data structure in R.

We have to convert data into time series data structures to apply time series algorithms.

Chile Data

library(dplyr)
data_chile = data %>% filter(Country == "Chile")
chile = data_chile[,c(2,5)]
head(chile)

Convert into time-series data structure*

  • 1,2,3,4,… as time index with no NA values.
library(zoo)
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...
ts_chile = as.ts(z) #conversion into time series data structure
head(ts_chile)
  • Date as a time index with NA values to be removed.
library(zoo)
library(tseries)#for removing na from ts()
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
ts_chile = as.ts(z)
ts_chile = na.remove(ts_chile)#remove na from time series
head(ts_chile)

If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.

Plot the time series

plot(ts_chile)

Two parts of the Time Series, we’ll discuss are

  • Decomposition (Trend + Seasonal + Error)
  • Forecasting

Decomposition

decompose(ts_chile)

This is happening because the data has no seasonal trend

Forecast

Model (ARIMA model)

library(forecast)
model = auto.arima(ts_chile)
model

ACF Plot

acf(model$residuals, main = "Correlogram")

PACF Plot

pacf(model$residuals, main = "Partial Correlogram")

Ljung Box Test

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.

Box.test(model$residuals, type = "Ljung-Box")

We do not reject the hypothesis, that the model shows a good fit.

Normality of Residuals

hist(model$residuals, freq = FALSE)
lines(density(model$residuals))

Forecast

forecast = forecast(model,4)#4 = number of units you want to 
library(ggplot2)
autoplot(forecast) #plot of the model
accuracy(forecast) #performance of the model
forecast #forecast values

Exercise: Apply it to other countries as well.

All the best. :D

--

--

Srijit Mukherjee
Srijit Mukherjee Content

I can help you carve a career in statistics, data science, and machine learning. Know me more at https://www.linkedin.com/in/srijit-mukherjee/.