This post is about analysing 20 million rows of flight on-time/delay performance using hadoop and hive.
Knowledge required: SQL and basic hadoop(hdfs, hive) concepts Tools required: Any host system(I’m using windows7), cloudera’s CDH 5.4.0 quickstart virtual instance
Introduction Get the flight data Load data in HDFS create table structure describe table load data show data in table Bucketing and sampling Running Ad-Hoc queries Graphs and visualizations
The Airline On-Time Performance Data, “contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.”
- This dataset can be used to work on cool travel ideas like:
- When is the best time of day/day of week/time of year to fly to minimise delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays?
- Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
Get the flight data
we can easily get the data from here. The data is provided in the form of .csv files from year 1987 to 2008. It’s a huge dataset(2 decades old) which contains around 120 million rows of flight details and sums up to about 12GB when uncompressed. Following is the detailed description of all the variables used in the dataset.
Some important variables: Time Related month quarter day of month Geography origin destination Operations delays arrival departure Flight Data carrier code tail number flight number
Load data in HDFS
Once we get the data, our next task is to load it in HDFS for further analysis. Currently the data is in the host OS’s file system. In order to load the data in HDFS we have to move the data from host OS to guest OS first and then from guest OS’s file-system to HDFS. To transer data from host to guest OS, we can simply create a shared folder in host OS and access it from guest OS. Now tranfer the data files from guest OS to HDFS using command line and hadoop fs commands.
Another easier way is upload it using HUE web GUI(from host OS).
Create table structure in hive
Now we will create a schema for the dataset and then load all the .csv files in this newly created table. In the background we are simply creating a logical table structure over the HDFS dataset using Hive. Note that at this point, only the logical structure is defined, data is still not loaded in the table. This property is called Schema on read. “Schema-on-read” means you do not need to know how you will use your data when you are storing it. This allows you to innovate quickly by asking different and more powerful questions after storing the data.
Create External Table
Here we’re creating an external hive table, so that the same dataset can be used for other operations(pig, Hbase) as well.
Once the table structure is created we can easily view it using Hive describe clause.
1 describe airlines;
Load data in Table
Next step is to load the data in the hive table that we created.
Check if data is loaded
Run a simple select/count query to check if the data is loaded correctly in the table.
Bucketing and data sampling
We can also bucket the data using the following query or can further use these buckets to create sample data for test anaysis.
1 2 3 4 5 // get count of each bucket select count(*) from airlines_bucketed tablesample(bucket 1 out of 22 on year); select count(*) from airlines_bucketed tablesample(bucket 2 out of 22 on year); select count(*) from airlines_bucketed tablesample(bucket 3 out of 22 on year); select count(*) from airlines_bucketed tablesample(bucket 5 out of 11 on year);
1 2 3 4 // fetching data from different buckets select distinct Year from airlines_bucketed tablesample(bucket 1 out of 22 on year); select distinct year from airlines_bucketed tablesample(bucket 2 out of 22 on year); select distinct year from airlines_bucketed tablesample(bucket 5 out of 10 on year);
1 2 3 4 SELECT distinct origin FROM airlines_bucketed tablesample(bucket 1 out of 22 on year) WHERE dest='sfo' ORDER BY day;
1 2 3 // block sampling select count(*) from airlines_bucketed tablesample(10 percent); (output count based on the number of blocks)
1 2 3 4 5 6 7 8 9 10 11 12 13 select min(year), max(year), uniqueCarrier, count(*) as cnt, sum(if(ArrDelay>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelay>30, 1, 0))/count(*),2) as rate FROM airlines WHERE DayOfWeek not in (6,7) and Origin not in ('AK', 'HI', 'PR', 'VI') and Dest not in ('AK', 'HI', 'PR', 'VI') and year < 2010 GROUP by uniquecarrier HAVING cnt > 100000 and max(year) > 1990 ORDER by rate DESC LIMIT 1000;
Running ad-hoc queries using Hive
We can use Hive to answer “ad-hoc” queries at the same cost as that for generating aggregate reports. For instance: What was the worst year due to delays caused by weather in the state of Oregon? Answer is 2004, with 383 delayed flights due to weather causes
We can use this hive query to get the result: // todo
Some simple Ad-hoc queries
Calculating average departure delay for all flights in minutes: 9.58
1 2 3 select avg(DepDelay) from airlines; query result: 8.17
Calculating departure delay standard deviation for all flights in minutes: 27.98
SD is a measure that is used to quantify the amount of variation or dispersion of a set of data values from its mean.
1 2 3 select stddev_pop(DepDelay) from airlines; query result: 28.50
Count of flights with ‘SFO’ as source
1 2 3 4 select sum(f.flights) flightsfrom flight_performance f join origin o on (f.origin = o.origin) where o.origin = 'SFO';
1 2 select year,sum(*) as TotalFlights from airlines where origin = 'SFO' group by year order by year;
1 2 3 // counts the flights and max delay at each airport select airport_cd, count(*), max(delay) from airlines group by airport_cd;
average arrival delay in minutes for each U.S. certified carrier
Delayed flights performance by state: Which states have had more relative delayed flights during this time? (For convenience, we will consider a delay each flight whose departure delay in minutes is greater than two times the average: 19.16 minutes).
1 2 3 select OriginState, count(if(DepDelay > 19.16, "", NULL)), count(*), count(if(DepDelayMinutes > 19.16, "", NULL)) / count(*) as pcent_retraso from airlines group by OriginState order by pcent_retraso desc;
ResultSet : State A: Delayed flights B: Flights A / B IL 1271765 7446363 0.17 NJ 413214 2567503 0.16 GA 934228 6012156 0.15
Delayed flights annual performance by carrier: Which flight carriers have had more relative delayed flights during this time, by year? This query shows us a “negative annual ranking” of carriers.
1 2 create table year_carrier_stats(year string, carrier string, delayed int, total int, percent double) row format delimited fields terminated by '\t' stored as textfile;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 insert overwrite table year_carrier_stats select Year, Carrier, count(if(DepDelayMinutes > 19.16, "", NULL)), count(*), count(if(DepDelayMinutes > 19.16, "", NULL)) / count(*) as pcent_retraso from my_table group by Year, Carrier order by Year asc, pcent_retraso desc;
1 select year,count(*) as TotalFlights from airlines where origin = 'SFO' group by year order by year;
ResultSet: Year Carrier A: Delayed flights B: Flights A / B 2007 EV 78482 286234 0.27 2007 AA 134338 633857 0.21 2007 MQ 110457 540494 0.20 2007 B6 38695 191450 0.20 2007 UA 97736 490002 0.19
1 2 3 4 5 6 7 8 9 10 11 12 select Year, count(if(WeatherDelay > 0, "", NULL)) as foo from my_table where OriginState like '%OR%' group by Year order by foo desc limit 1;
Let’s take the following hypothesis: The later in the day, the more likely it is that a flight will be delayed. Is there any relationship between arrival time and delay time of a flight?
A simple way of proving this hypothesis is through a simple query that extracts and creates groupings by arrival time, measuring the average delay time in each of the time intervals.
1 2 3 4 5 6 7 8 9 select substr(DepTime, 0, 2) as foo, avg(ArrDelayMinutes) from my_table group by substr(DepTime, 0, 2) order by foo asc;
Indeed, the later in the day, the more delays there are, because flight delays are usually accumulative. Because flights that arrive late in the night still arrive after midnight, the early hours of the day are the ones that accumulate the most delay time.
Creating visualizations and graphs
Other questions that could be asked the same way and at the same cost: What days in the week accumulated the majority of delayed flights? Which flights are delayed the most, by company?
ToDO delays by year delay by month or week delay on hourly basis of a day busiest airports top cities When is the best time of day/day of week/time of year to fly to minimise delays? Do older planes suffer more delays? How does the number of people flying between different locations change over time? How well does weather predict plane delays? Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?