In Part 2 I modified the SSIS data flow to add a column containing the position of each trackpoint read from the GPX file in the SQL Server spatial type geography . Part three investigates SQL spatial types further and walks through a use case to build on the existing design: reverse geocoding. This will be useful in our final BI solution as it provides another dimension for analysis.
What is reverse geocoding?
Reverse geocoding is the process of turning a set of co-ordinates back into a readable place name. Depending on the granularity of data this could be a street name, town or city, country, continent, etc.
There are several ways to reverse geocode data: the most popular and simplest is using an map provider’s API, such as Bing or Google. They are fast and accurate and also extremely easy to use. But what’s the fun in that?
I want to roll my own for a couple of reasons: a) it’s a good way of learning about spatial data in SQL Server and b) an average day’s GPX contains roughly 3500 points. Multiply that by the 110 days I was cycling, that’s a lot of API calls. Most map providers give a limited number of API calls and that number’s not 385,000.
To reverse geocode the co-ordinates we need a dataset containing the shapes queried when the question is asked “In which part of the world are these co-ordinates located?”.
There are lots of options with differing levels of detail, some free and many paid for. As my cycling was limited to Europe and I’m only interested in my location down to the region level, I’ll be using the free NUTS dataset published by the EU, in particular the 1:3 million shapefile (NUTS_2010_03M_SH.zip) [download].
The NUTS dataset comprises of 4 levels (0,1,2,3) ranging from the largest to smallest territorial units. The actual breakdown depends on the country and how it is divided geographically. For instance the entire UK is level 0, level 1 comprises of regions of England, level 2 is counties and level 3 authorities – see wiki for more.
The NUTS zip file contains (amongst other things) a shapefile which needs to be imported into SQL Server. This sounds like it should be an easy task but Microsoft provide zero tooling to help; instead you’ve got to choose between some powerful (and feature rich) applications which cost a fortune or use some freeware/open source tools which are harder to get your head around. As I quit my job to go cycling I think we’ll take option two.
Two free tools are Shape2SQL and ogr2ogr (part of osgeo4w). Shape2SQL has a nice user interface but I couldn’t get it to work with SQL 2014 (clicking ‘Upload to Database’ did nothing). ogr2ogr is extremely powerful but command line based and getting it to do what you want takes a bit of practice and research. Alastair Aitchison has a very useful blog post detailing how to install osgeo4w as well as a number of ogr2ogr recipes.
Importing the NUTS shapefile into SQL Server
Open up the osgeo4w shell. This sets the correct environment variables to allow ogr2ogr to run.
ogr2ogr -f "MSSQLSpatial" "MSSQL:server=localhost;database=Nuts2008;trusted_connection=yes" "NUTS_2010_03M_SH\Data\NUTS_RG_03M_2010.shp" -a_srs "EPSG:4258" -t_SRS "EPSG:4326" -overwrite -lco "GEOM_TYPE=geography" -nln "Nuts2008_Geog" -lco "GEOM_NAME=shape" -progress
The command contains the following parameters:
- -f “MSSQLSpatial” “connection_string” : output format as Microsoft SQL Spatial data with target database connection string
- “NUTS_2010_03M_SH\Data\NUTS_RG_03M_2010.shp” : location of the NUTS shapefile
- -a_srs “EPSG:4258” : source data EPSG is 4258
- -t_srs “EPSG:4326” : target data EPSG is 4326 (to match the EPSG of the geography data from our GPX file – this is the 4326 used as the third parameter to SqlGeography.Point in Part 1)
- -overwrite : overwrite existing table data
- -lco “GEOM_TYPE=geography” : use SQL Server’s geography data type (instead of geometry, the default)
- -nln “Nuts2008_Geog” : table name to store the imported shapefiles
- -lco “GEOM_NAME=shape” : column name of geography data
- -progress : indicate progress
The data should now be in the destination table but it won’t map properly in management studio due to the way some of the polygons are stored:
I don’t fully understand why but after several hours of research I figured out there are a couple more steps required to make the data display correctly. Run the following commands in the Nuts2008 database:
--make valid all geography update Nuts2008_Geog set shape=shape.MakeValid() --fix polygon issues (NOTE: run this twice!) update Nuts2008_Geog set shape=shape.ReorientObject() where shape.EnvelopeAngle() > 90
Now the shapes will be visible on the map surface in the spatial viewer tab. The data is now ready for use in the reverse geocoding process.
Reverse geocoding imported GPX points
The geography data type’s STIntersects method can be used to determine whether 2 geography instances intersect. In our case, for each imported GPX point we want to determine which shape (region) it intersects. The name of this shape will give us the name of the region where the point is located.
A spatial index makes the lookup much faster, I chose the GEOGRAPHY_AUTO_GRID option. It can be tuned further but for now the query’s fast enough for my requirements.
CREATE SPATIAL INDEX [IDX_SI_NUTS] ON [dbo].[Nuts2008_Geog] ([shape])USING GEOGRAPHY_AUTO_GRID WITH ( CELLS_PER_OBJECT = 16) GO
The query to reverse geocode each point from the GPX is as follows:
--Lookup each point select gpxdata.lat, gpxdata.lon, gpxdata.loggedtime, nutsgeog.nuts_id, nutsgeog.stat_levl_, nutsgeog.shape from GarminGPX.dbo.GPXData gpxdata, Nuts2008_Geog nutsgeog where stat_levl_ = 3 -- only interested in NUTS level 3 (the smallest areas) and gpxdata.location.STIntersects(nutsgeog.shape) = 1 order by loggedtime
Now looking up the place name manually isn’t that useful when you’re talking about almost 400,000 points. There must be a better way! And there is: I imported all the regions into a SQL Server table (source) then denormalized the data. For any level 3 region I can find out which level 2/1/0 regions it belongs to (import script available here: Nuts2008_Level3.sql).
Joining this new table with the nuts_id column from the Nuts2008_Geog table completes the reverse geocoding process.
select gpxdata.lat, gpxdata.lon, gpxdata.loggedtime, nutsgeog.nuts_id, nutslevel3.Level3Name, nutslevel3.Level2Name, nutslevel3.Level1Name, nutslevel3.Level0Name from GarminGPX.dbo.GPXData gpxdata, Nuts2008_Geog nutsgeog WITH( INDEX([IDX_SI_NUTS])) -- index hint as the SQL optimiser doesn't always appear to use the spatial index join Nuts2008_Level3 nutslevel3 on nutsgeog.nuts_id=nutslevel3.Level3Code where gpxdata.location.STIntersects(nutsgeog.shape) = 1 and nutsgeog.stat_levl_ = 3 order by loggedtime asc
In my sample GPX file, chosen because I spent the day border hopping, I passed back and forth through different regions of Germany and Switzerland:
An updated SSIS project with the script component adding the geography column (from part 2) and sample GPX file is available to download here: Garmin-GPX-to-SQLServer-part3.zip.
We’ve figured out how to reverse-geocode each point in SQL Server so in part 4 we’ll be extending the workflow to include the additional metadata.