-
Notifications
You must be signed in to change notification settings - Fork 254
Vehicle Trip Discovery with GIS Tools for Hadoop
An interesting task in highway management is to study potential impact of driver carpooling, based on an analysis of automatically collected automobile GPS position data. To identify potential enhancements to carpool participation, we set out to study places that have the highest numbers of trips with similar origin and destination locations. The source data for this experiment consists of nearly 40 million vehicle position records assembled from a single day of GPS-collected vehicle positions. The position data consists of longitude and latitude along with date, time, and speed. We used the Hadoop MapReduce framework for distributed parallel computation, considering its capability for analyzing larger data sets.
The study area was the country of Japan, selected from World Map in the data included with ArcGIS. It was projected to the Tokyo Geographic Coordinate System, with the Project geoprocessing tool in ArcMap, to match the spatial reference of the GPS data.
Next, the study area was exported to JSON format, suitable for use in computations on Hadoop, by using the Features To JSON tool.
The input feature class is the country of Japan in Tokyo GCS, as
produced in the previous step. The output JSON parameter is a
filename of our choosing, with .json
extension. For
the remaining parameters, the defaults are suitable: JSON type (ENCLOSED_JSON) and Formatted JSON
(unchecked).
Then the JSON file representing the features, was copied to HDFS, to
be accessible to the computations on Hadoop. This can be done
either with command-line hadoop fs
, or by using the Copy To HDFS tool.
The tool expects the following parameters:
- The Input local file, is the JSON file output by the previous step.
- The HDFS parameters for hostname, port, and username - these relate to the configuration of Hadoop; values can be provided by the system administrator (the default port number should work unless the admin has configured a non-default port).
- The HDFS remote file - the path and file name where the tool will copy the file.
Calculations were done in two stages of MapReduce applications on Hadoop. As a cursory overview of MapReduce: first a Mapper associates a list of keys each with a list of values, then for each key, a Reducer performs a calculation on the values for that key, and emits any output records for that key. The first MapReduce application used in our study, creates a grid of cells covering Japan, infers origin and destination of trips from sequences of vehicle positions, and identifies the grid cell containing the origin and destination point of each inferred trip.
Columns of the CSV file of GPS positions, include vehicle ID, date, time, position as longitude and latitude (degrees-minutes-seconds), and speed (km/h). The mapper of the first-stage MapReduce application, reads the input CSV data, and treats the combination of car-ID and date as the key - and the remainder of the input fields, as the value associated with such key. Thus, the data is grouped by car and date, for passing to the reducer, a separate list of position records for each car and date.
In order to support trips that span midnight - when in possession of multiple days of data - a mapper could use a key consisting of the car ID only - passing to the reducer, a potentially longer list of position records, which would need to be sorted by date and time.
The grid of cells to cover Japan, is calculated in the setup of the
reducer of the first MapReduce application - once per reducer node.
If we had simply used an equal-angle grid with the
geographic coordinates, the geodesic area of a cell would have differed
by about 18% between northernmost and southernmost Japan - so we
generate an equal-area grid, while still using geographic coordinates.
To produce an equal-area grid, the code takes the envelope of the study
area, and uses the middle latitude as a baseline. It uses
GeometryEngine.geodesicDistanceOnWGS84
to calculate the
length of a 1°
arc along the latitude (using WGS84 to make a close approximation of
distance - not position - on Tokyo GCS), then uses proportions to find
the X-axis angle corresponding to the desired cell width. With a
constant X-axis angle, it calculates variable Y-axis grid angles,
starting at the southernmost latitude and working northward, using
GeometryEngine.geodesicDistanceOnWGS84
on the X-axis angle
cell width,
and dividing into the constant area to get the Y-axis angle for each
row of the grid.
Finding the cell containing a location point, when the cells are
stored as an array of bounds, and each row has the same number of
cells, proceeds as follows. The X-axis index is a straightforward
division by cell width. The Y-axis calculation is adjusted for
possible overshoot or undershoot, due to varying cell height.
Then cellIndex = xIndex + xCount * yIndex
.
Inferred trips were calculated as follows. The GPS units in the cars transmit a point of position data about every 30 seconds (with some variation), while the car is on. For successive positions for the same car, the code looks for lapses in the position points. For each lapse of more than 15 minutes, it interprets the position before the lapse as the destination of a trip, and the position after the lapse as the origin of a new trip. Also, the last position of the car in the day, is interpreted as the destination of a trip (except in the case of a lone position point).
This trip-inference computation relies on the fact that the car GPS
unit does not transmit data when the car is off. For trip discovery on
data from GPS units that continue transmitting while the vehicle is
off, it would be necessarily to additionally check whether the car has
in fact moved more than a threshold roaming distance (using GeometryEngine.geodesicDistanceOnWGS84
).
The input for the second-stage MapReduce job, was the output of the first stage, namely the longitude, latitude, and grid-cell bounds of both the origin and destination of the trips. The second-stage MapReduce job grouped the inferred trips by origin cell, in the mapper. Then for each origin cell, the reducer counted trips by grid cell containing the destination of the trip, to determine - for that origin cell - the number of trips ending in the most-common destination cell.
To execute the calculations, the MapReduce jobs can be invoked by using either the Execute Workflow tool in ArcMap, or the command line. Here are recipes for invoking the MapReduce jobs from command line:
env HADOOP_CLASSPATH=../lib/esri-geometry-api.jar:../lib/spatial-sdk-hadoop.jar \
hadoop jar trip-discovery.jar com.esri.hadoop.examples.trip.TripCellDriver \
-libjars ../lib/esri-geometry-api.jar,../lib/spatial-sdk-hadoop.jar \
15 500 trips/japan-tokyo-gcs.json trips/vehicle-positions.csv \
out-trips-inferred
env HADOOP_CLASSPATH=../lib/esri-geometry-api.jar \
hadoop jar trip-discovery.jar com.esri.hadoop.examples.trip.TripInCommonDriver \
-libjars ../lib/esri-geometry-api.jar \
1 'out-trips-inferred/part-r-*' out-trips-by-origin-cell
The tab-separated output was converted to JSON - a format suitable for import to ArcMap - with Hive queries using ST_Geometry functions from GIS Tools for Hadoop.
create external table trips_by_origin_cell(leftlon double, botlat double, rightlon double, toplat double,
totnum int, samedest int, pct double,
destlhs double, destbot double, destrhs double, desttop double)
row format delimited fields terminated by '\t'
location '/user/rwhitman/out-trips-by-origin-cell';
create external table trip_origin_json (totnum int, samedest int, pct double, destlhs double, destbot double,
destrhs double, desttop double, shape binary)
row format serde 'com.esri.hadoop.hive.serde.JsonSerde' -- with v2, use EsriJsonSerDe
stored as inputformat 'com.esri.json.hadoop.UnenclosedJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/user/rwhitman/trip-origin-json';
insert overwrite table trip_origin_json select totnum, samedest, pct, destlhs, destbot, destrhs, desttop,
ST_Polygon(leftlon,botlat, rightlon,botlat, rightlon,toplat, leftlon,toplat) from trips_by_origin_cell;
Next, the JSON file of results, was copied from HDFS, to
be accessible to ArcMap. This can be done either with command-line hadoop fs
,
or by using the Copy From HDFS tool.
The tool expects the following parameters:
- The HDFS parameters for hostname, port, and username - these relate to the configuration of Hadoop - same as with the Copy To HDFS tool.
- The HDFS remote file - is the directory in HDFS generated by the previous step - it contains a JSON file.
- The Output local file parameter is a filename of our choosing, with
.json
extension.
Then the results were imported to ArcMap as a feature class, by using the JSON To Features tool.
The Input JSON is the file copied over in the previous step. The Output feature class is a new ArcGIS feature class, with a name of our choosing. For JSON type, be sure to select UNENCLOSED_JSON. As the unenclosed JSON that was imported does not carry information as to spatial reference, it is necessary to right click the newly-imported feature class in the Catalog - then Properties, XY Coordinate System, Geographic Coordinate Systems, Asia, Tokyo.
Finally, we used ArcMap to visualize the results in a map. As the JSON that was imported had all fields as text (rather than number), it was necessary to add an integer field for the count of the trips from each origin cell, that ended in the most common destination cell - which the ArcMap field calculator populated handily (from the text field that contained the number as character digits). Then, it was possible to use this integer-value field for symbols varying by quantity. The map shows cells that were the origin of five or more trips to a common destination cell. Cells that were the origin of 11 or trips to a common destination, are symbolized with bigger and darker purple squares, to highlight candidate areas for carpool suggestions.
A potential further study could additionally consider the following:
- The location and distance of the most-common destination cell, for each origin cell of interest;
- Time slices, as many people need to go to their destination during a specific part of the day.
The following open-source projects, from GIS Tools for Hadoop on Github, were used:
- The geometry-api-java library was used for computing equal-area grid cells.
- From spatial-framework-for-hadoop Hive UDFs were used to create geometries from raw data, and API utilities were used to import and export data from and to JSON.
- Geoprocessing tools were used from geoprocessing-tools-for-hadoop to import results into ArcMap for visualization, as well as to export a polygon of Japan for gridding.
The MapReduce calculations ran in under an hour on a single-node development instance of Hadoop. This provides a proof of concept of using a Hadoop cluster to run similar calculations on much bigger data sets.
The complete source code is available on Github.