OnTime
航班飞行数据有以下两个方式获取:
- 从原始数据导入
- 下载预处理好的数据
从原始数据导入
下载数据:
wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip
创建表结构:
CREATE TABLE `ontime`(`Year` UInt16,`Quarter` UInt8,`Month` UInt8,`DayofMonth` UInt8,`DayOfWeek` UInt8,`FlightDate` Date,`Reporting_Airline` String,`DOT_ID_Reporting_Airline` Int32,`IATA_CODE_Reporting_Airline` String,`Tail_Number` String,`Flight_Number_Reporting_Airline` String,`OriginAirportID` Int32,`OriginAirportSeqID` Int32,`OriginCityMarketID` Int32,`Origin` FixedString(5),`OriginCityName` String,`OriginState` FixedString(2),`OriginStateFips` String,`OriginStateName` String,`OriginWac` Int32,`DestAirportID` Int32,`DestAirportSeqID` Int32,`DestCityMarketID` Int32,`Dest` FixedString(5),`DestCityName` String,`DestState` FixedString(2),`DestStateFips` String,`DestStateName` String,`DestWac` Int32,`CRSDepTime` Int32,`DepTime` Int32,`DepDelay` Int32,`DepDelayMinutes` Int32,`DepDel15` Int32,`DepartureDelayGroups` String,`DepTimeBlk` String,`TaxiOut` Int32,`WheelsOff` Int32,`WheelsOn` Int32,`TaxiIn` Int32,`CRSArrTime` Int32,`ArrTime` Int32,`ArrDelay` Int32,`ArrDelayMinutes` Int32,`ArrDel15` Int32,`ArrivalDelayGroups` Int32,`ArrTimeBlk` String,`Cancelled` UInt8,`CancellationCode` FixedString(1),`Diverted` UInt8,`CRSElapsedTime` Int32,`ActualElapsedTime` Int32,`AirTime` Nullable(Int32),`Flights` Int32,`Distance` Int32,`DistanceGroup` UInt8,`CarrierDelay` Int32,`WeatherDelay` Int32,`NASDelay` Int32,`SecurityDelay` Int32,`LateAircraftDelay` Int32,`FirstDepTime` String,`TotalAddGTime` String,`LongestAddGTime` String,`DivAirportLandings` String,`DivReachedDest` String,`DivActualElapsedTime` String,`DivArrDelay` String,`DivDistance` String,`Div1Airport` String,`Div1AirportID` Int32,`Div1AirportSeqID` Int32,`Div1WheelsOn` String,`Div1TotalGTime` String,`Div1LongestGTime` String,`Div1WheelsOff` String,`Div1TailNum` String,`Div2Airport` String,`Div2AirportID` Int32,`Div2AirportSeqID` Int32,`Div2WheelsOn` String,`Div2TotalGTime` String,`Div2LongestGTime` String,`Div2WheelsOff` String,`Div2TailNum` String,`Div3Airport` String,`Div3AirportID` Int32,`Div3AirportSeqID` Int32,`Div3WheelsOn` String,`Div3TotalGTime` String,`Div3LongestGTime` String,`Div3WheelsOff` String,`Div3TailNum` String,`Div4Airport` String,`Div4AirportID` Int32,`Div4AirportSeqID` Int32,`Div4WheelsOn` String,`Div4TotalGTime` String,`Div4LongestGTime` String,`Div4WheelsOff` String,`Div4TailNum` String,`Div5Airport` String,`Div5AirportID` Int32,`Div5AirportSeqID` Int32,`Div5WheelsOn` String,`Div5TotalGTime` String,`Div5LongestGTime` String,`Div5WheelsOff` String,`Div5TailNum` String) ENGINE = MergeTreePARTITION BY YearORDER BY (IATA_CODE_Reporting_Airline, FlightDate)SETTINGS index_granularity = 8192;
加载数据:
ls -1 *.zip | xargs -I{} -P $(nproc) bash -c "echo {}; unzip -cq {} '*.csv' | sed 's/\.00//g' | clickhouse-client --input_format_with_names_use_header=0 --query='INSERT INTO ontime FORMAT CSVWithNames'"
下载预处理好的分区数据
$ curl -O https://datasets.clickhouse.com/ontime/partitions/ontime.tar$ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory$ # check permissions of unpacked data, fix if required$ sudo service clickhouse-server restart$ clickhouse-client --query "select count(*) from datasets.ontime"
!!! info “信息”
如果要运行下面的SQL查询,必须使用完整的表名,datasets.ontime。
查询:
Q0.
SELECT avg(c1)FROM(SELECT Year, Month, count(*) AS c1FROM ontimeGROUP BY Year, Month);
Q1. 查询从2000年到2008年每天的航班数
SELECT DayOfWeek, count(*) AS cFROM ontimeWHERE Year>=2000 AND Year<=2008GROUP BY DayOfWeekORDER BY c DESC;
Q2. 查询从2000年到2008年每周延误超过10分钟的航班数。
SELECT DayOfWeek, count(*) AS cFROM ontimeWHERE DepDelay>10 AND Year>=2000 AND Year<=2008GROUP BY DayOfWeekORDER BY c DESC;
Q3. 查询2000年到2008年每个机场延误超过10分钟以上的次数
SELECT Origin, count(*) AS cFROM ontimeWHERE DepDelay>10 AND Year>=2000 AND Year<=2008GROUP BY OriginORDER BY c DESCLIMIT 10;
Q4. 查询2007年各航空公司延误超过10分钟以上的次数
SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*)FROM ontimeWHERE DepDelay>10 AND Year=2007GROUP BY CarrierORDER BY count(*) DESC;
Q5. 查询2007年各航空公司延误超过10分钟以上的百分比
SELECT Carrier, c, c2, c*100/c2 as c3FROM(SELECTIATA_CODE_Reporting_Airline AS Carrier,count(*) AS cFROM ontimeWHERE DepDelay>10AND Year=2007GROUP BY Carrier) qJOIN(SELECTIATA_CODE_Reporting_Airline AS Carrier,count(*) AS c2FROM ontimeWHERE Year=2007GROUP BY Carrier) qq USING CarrierORDER BY c3 DESC;
更好的查询版本:
SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*100 AS c3FROM ontimeWHERE Year=2007GROUP BY CarrierORDER BY c3 DESC
Q6. 同上一个查询一致,只是查询范围扩大到2000年到2008年
SELECT Carrier, c, c2, c*100/c2 as c3FROM(SELECTIATA_CODE_Reporting_Airline AS Carrier,count(*) AS cFROM ontimeWHERE DepDelay>10AND Year>=2000 AND Year<=2008GROUP BY Carrier) qJOIN(SELECTIATA_CODE_Reporting_Airline AS Carrier,count(*) AS c2FROM ontimeWHERE Year>=2000 AND Year<=2008GROUP BY Carrier) qq USING CarrierORDER BY c3 DESC;
更好的查询版本:
SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*100 AS c3FROM ontimeWHERE Year>=2000 AND Year<=2008GROUP BY CarrierORDER BY c3 DESC;
Q7. 每年航班延误超过10分钟的百分比
SELECT Year, c1/c2FROM(selectYear,count(*)*100 as c1from ontimeWHERE DepDelay>10GROUP BY Year) qJOIN(selectYear,count(*) as c2from ontimeGROUP BY Year) qq USING (Year)ORDER BY Year;
更好的查询版本:
SELECT Year, avg(DepDelay>10)*100FROM ontimeGROUP BY YearORDER BY Year;
Q8. 每年更受人们喜爱的目的地
SELECT DestCityName, uniqExact(OriginCityName) AS uFROM ontimeWHERE Year >= 2000 and Year <= 2010GROUP BY DestCityNameORDER BY u DESC LIMIT 10;
Q9.
SELECT Year, count(*) AS c1FROM ontimeGROUP BY Year;
Q10.
SELECTmin(Year), max(Year), IATA_CODE_Reporting_Airline AS Carrier, count(*) AS cnt,sum(ArrDelayMinutes>30) AS flights_delayed,round(sum(ArrDelayMinutes>30)/count(*),2) AS rateFROM ontimeWHEREDayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')AND FlightDate < '2010-01-01'GROUP by CarrierHAVING cnt>100000 and max(Year)>1990ORDER by rate DESCLIMIT 1000;
Bonus:
SELECT avg(cnt)FROM(SELECT Year,Month,count(*) AS cntFROM ontimeWHERE DepDel15=1GROUP BY Year,Month);SELECT avg(c1) FROM(SELECT Year,Month,count(*) AS c1FROM ontimeGROUP BY Year,Month);SELECT DestCityName, uniqExact(OriginCityName) AS uFROM ontimeGROUP BY DestCityNameORDER BY u DESCLIMIT 10;SELECT OriginCityName, DestCityName, count() AS cFROM ontimeGROUP BY OriginCityName, DestCityNameORDER BY c DESCLIMIT 10;SELECT OriginCityName, count() AS cFROM ontimeGROUP BY OriginCityNameORDER BY c DESCLIMIT 10;
这个性能测试由Vadim Tkachenko提供。参考:
- https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/
- https://www.percona.com/blog/2009/10/26/air-traffic-queries-in-luciddb/
- https://www.percona.com/blog/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/
- https://www.percona.com/blog/2014/04/21/using-apache-hadoop-and-impala-together-with-mysql-for-data-analysis/
- https://www.percona.com/blog/2016/01/07/apache-spark-with-air-ontime-performance-data/
- http://nickmakos.blogspot.ru/2012/08/analyzing-air-traffic-performance-with.html
