Skip to content

Vehicle Trip Discovery with GIS Tools for Hadoop

Randall Whitman edited this page Jan 30, 2019 · 1 revision

Vehicle Trip Discovery with GIS Tools for Hadoop

An interesting task in highway management is to study potential impact of driver carpooling, based on an analysis of automatically collected automobile GPS position data. To identify potential enhancements to carpool participation, we set out to study places that have the highest numbers of trips with similar origin and destination locations. The source data for this experiment consists of nearly 40 million vehicle position records assembled from a single day of GPS-collected vehicle positions. The position data consists of longitude and latitude along with date, time, and speed. We used the Hadoop MapReduce framework for distributed parallel computation, considering its capability for analyzing larger data sets.

Study Area

The study area was the country of Japan, selected from World Map in the data included with ArcGIS. It was projected to the Tokyo Geographic Coordinate System, with the Project geoprocessing tool in ArcMap, to match the spatial reference of the GPS data.

[GP Tool: Project to Tokyo GCS]

Next, the study area was exported to JSON format, suitable for use in computations on Hadoop, by using the Features To JSON tool.

[GP Tool: Features To JSON]

The input feature class is the country of Japan in Tokyo GCS, as produced in the previous step. The output JSON parameter is a filename of our choosing, with .json extension. For the remaining parameters, the defaults are suitable: JSON type (ENCLOSED_JSON) and Formatted JSON (unchecked).

Then the JSON file representing the features, was copied to HDFS, to be accessible to the computations on Hadoop.  This can be done either with command-line hadoop fs, or by using the Copy To HDFS tool.

copy-to-hdfs-japan-arcmap

The tool expects the following parameters:

  • The Input local file, is the JSON file output by the previous step.
  • The HDFS parameters for hostname, port, and username - these relate to the configuration of Hadoop; values can be provided by the system administrator (the default port number should work unless the admin has configured a non-default port).
  • The HDFS remote file - the path and file name where the tool will copy the file.

Calculations were done in two stages of MapReduce applications on Hadoop. As a cursory overview of MapReduce: first a Mapper associates a list of keys each with a list of values, then for each key, a Reducer performs a calculation on the values for that key, and emits any output records for that key. The first MapReduce application used in our study, creates a grid of cells covering Japan, infers origin and destination of trips from sequences of vehicle positions, and identifies the grid cell containing the origin and destination point of each inferred trip.

Read Input CSV

Columns of the CSV file of GPS positions, include vehicle ID, date, time, position as longitude and latitude (degrees-minutes-seconds), and speed (km/h). The mapper of the first-stage MapReduce application, reads the input CSV data, and treats the combination of car-ID and date as the key - and the remainder of the input fields, as the value associated with such key. Thus, the data is grouped by car and date, for passing to the reducer, a separate list of position records for each car and date.

class TripCellMapper

In order to support trips that span midnight - when in possession of multiple days of data - a mapper could use a key consisting of the car ID only - passing to the reducer, a potentially longer list of position records, which would need to be sorted by date and time.

Create grid

The grid of cells to cover Japan, is calculated in the setup of the reducer of the first MapReduce application - once per reducer node. If we had simply used an equal-angle grid with the geographic coordinates, the geodesic area of a cell would have differed by about 18% between northernmost and southernmost Japan - so we generate an equal-area grid, while still using geographic coordinates. To produce an equal-area grid, the code takes the envelope of the study area, and uses the middle latitude as a baseline. It uses GeometryEngine.geodesicDistanceOnWGS84 to calculate the length of a 1° arc along the latitude (using WGS84 to make a close approximation of distance - not position - on Tokyo GCS), then uses proportions to find the X-axis angle corresponding to the desired cell width. With a constant X-axis angle, it calculates variable Y-axis grid angles, starting at the southernmost latitude and working northward, using GeometryEngine.geodesicDistanceOnWGS84 on the X-axis angle cell width, and dividing into the constant area to get the Y-axis angle for each row of the grid.

TripCellReducer#buildGrid

Cell containing location point

Finding the cell containing a location point, when the cells are stored as an array of bounds, and each row has the same number of cells, proceeds as follows.  The X-axis index is a straightforward division by cell width.  The Y-axis calculation is adjusted for possible overshoot or undershoot, due to varying cell height.  Then cellIndex = xIndex + xCount * yIndex.

TripCellReducer#queryGrid

Discover trips: origin and destination

Inferred trips were calculated as follows. The GPS units in the cars transmit a point of position data about every 30 seconds (with some variation), while the car is on. For successive positions for the same car, the code looks for lapses in the position points. For each lapse of more than 15 minutes, it interprets the position before the lapse as the destination of a trip, and the position after the lapse as the origin of a new trip. Also, the last position of the car in the day, is interpreted as the destination of a trip (except in the case of a lone position point).

TripCellReducer#reduce

This trip-inference computation relies on the fact that the car GPS unit does not transmit data when the car is off. For trip discovery on data from GPS units that continue transmitting while the vehicle is off, it would be necessarily to additionally check whether the car has in fact moved more than a threshold roaming distance (using GeometryEngine.geodesicDistanceOnWGS84).

Common destination of trips from common origin

The input for the second-stage MapReduce job, was the output of the first stage, namely the longitude, latitude, and grid-cell bounds of both the origin and destination of the trips. The second-stage MapReduce job grouped the inferred trips by origin cell, in the mapper. Then for each origin cell, the reducer counted trips by grid cell containing the destination of the trip, to determine - for that origin cell - the number of trips ending in the most-common destination cell.

TripInCommonReducer#reduce

Execute MapReduce Jobs

To execute the calculations, the MapReduce jobs can be invoked by using either the Execute Workflow tool in ArcMap, or the command line. Here are recipes for invoking the MapReduce jobs from command line:

env HADOOP_CLASSPATH=../lib/esri-geometry-api.jar:../lib/spatial-sdk-hadoop.jar \
  hadoop jar trip-discovery.jar com.esri.hadoop.examples.trip.TripCellDriver \
  -libjars ../lib/esri-geometry-api.jar,../lib/spatial-sdk-hadoop.jar \
  15 500 trips/japan-tokyo-gcs.json trips/vehicle-positions.csv \
  out-trips-inferred

env HADOOP_CLASSPATH=../lib/esri-geometry-api.jar \
  hadoop jar trip-discovery.jar com.esri.hadoop.examples.trip.TripInCommonDriver \
  -libjars ../lib/esri-geometry-api.jar \
  1 'out-trips-inferred/part-r-*' out-trips-by-origin-cell

Import into ArcMap and visualize

The tab-separated output was converted to JSON - a format suitable for import to ArcMap - with Hive queries using ST_Geometry functions from GIS Tools for Hadoop.

create external table trips_by_origin_cell(leftlon double, botlat double, rightlon double, toplat double,
                                           totnum int, samedest int, pct double,
                                           destlhs double, destbot double, destrhs double, desttop double)
row format delimited fields terminated by '\t'
location '/user/rwhitman/out-trips-by-origin-cell';

create external table trip_origin_json  (totnum int, samedest int, pct double, destlhs double, destbot double,
                                         destrhs double, desttop double, shape binary)
row format serde 'com.esri.hadoop.hive.serde.JsonSerde'   -- with v2, use EsriJsonSerDe
stored as inputformat 'com.esri.json.hadoop.UnenclosedJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/user/rwhitman/trip-origin-json';

insert overwrite table trip_origin_json select totnum, samedest, pct, destlhs, destbot, destrhs, desttop,
  ST_Polygon(leftlon,botlat, rightlon,botlat, rightlon,toplat, leftlon,toplat) from trips_by_origin_cell;

Next, the JSON file of results, was copied from HDFS, to be accessible to ArcMap. This can be done either with command-line hadoop fs, or by using the Copy From HDFS tool.

[GP Tool: Copy From HDFS]

The tool expects the following parameters:

  • The HDFS parameters for hostname, port, and username - these relate to the configuration of Hadoop - same as with the Copy To HDFS tool.
  • The HDFS remote file - is the directory in HDFS generated by the previous step - it contains a JSON file.
  • The Output local file parameter is a filename of our choosing, with .json extension.

Then the results were imported to ArcMap as a feature class, by using the JSON To Features tool. [GP tool: JSON To Features]

The Input JSON is the file copied over in the previous step. The Output feature class is a new ArcGIS feature class, with a name of our choosing. For JSON type, be sure to select UNENCLOSED_JSON. As the unenclosed JSON that was imported does not carry information as to spatial reference, it is necessary to right click the newly-imported feature class in the Catalog - then Properties, XY Coordinate System, Geographic Coordinate Systems, Asia, Tokyo.

[Feature Class Properties: Spatial Reference]

Finally, we used ArcMap to visualize the results in a map. As the JSON that was imported had all fields as text (rather than number), it was necessary to add an integer field for the count of the trips from each origin cell, that ended in the most common destination cell - which the ArcMap field calculator populated handily (from the text field that contained the number as character digits). Then, it was possible to use this integer-value field for symbols varying by quantity. The map shows cells that were the origin of five or more trips to a common destination cell. Cells that were the origin of 11 or trips to a common destination, are symbolized with bigger and darker purple squares, to highlight candidate areas for carpool suggestions.

Map: by origin cell, count of car trips to common destination cell

Conclusions

A potential further study could additionally consider the following:

  • The location and distance of the most-common destination cell, for each origin cell of interest;
  • Time slices, as many people need to go to their destination during a specific part of the day.

The following open-source projects, from GIS Tools for Hadoop on Github, were used:

The MapReduce calculations ran in under an hour on a single-node development instance of Hadoop. This provides a proof of concept of using a Hadoop cluster to run similar calculations on much bigger data sets.

The complete source code is available on Github.