Bike Tour Business Intelligence: Part 5 [Dimensions]

Our star schema has three dimensions – Date, Time and Location. This post will cover populating the dimension tables in preparation for cube generation.

Date Dimension

Date

There are lots of date dimension generation scripts online, Mike Davies’ script is good, I and modified it to remove some columns that aren’t useful. The table is created and populated using an Execute SQL Task.

Time Dimension

Time

Another Execute SQL Task to run a script written by Mike Davies, modifying it to use the SQL Server time data type for the Time column.

Both tasks are kept in a Sequence Container run after the staging table generation flow.

part5_dimdatetime

Location Dimension

Location

The Location Dimension’s based off the NUTS dataset, storing the shapefile and the denormalised location data for each area. The overall flow is slightly more complex than for the date and time dimensions – first the DimLocation table is created, then populated and then some indexes are created.

DimLocation Container Design

DimLocation Container Design

Populating the DimLocation table is a straightforward Extract-Load Data Flow Task running a query against the Nuts2008_Geog and Nuts2008_Level3 tables we imported into SQL Server in Part 3. The data source is a SQL query which retrieves the shape data alongside the NUTS Level 3/2/1/0 region codes and names, and the results are saved to an OLE DB Destination pointing to the BI database.

The SQL query to get the shapes and location data (and sample result set) is below.

part5_queryresults

Finally two indexes are built on the table – the clustered index and a spatial index on the Shape column.

Posted in SQL Server, SSIS | Leave a comment

Bike Tour Business Intelligence: Part 4 [Populating a staging table]

A quick recap of the series so far: I’m building a cube using the data recorded by my Garmin Edge GPS during a 4 month bike tour last across Europe last year. So far we’ve analysed a single GPX file and used SSIS to import it into a SQL Server database, as well as using SQL Server’s spatial data functionality to reverse-geocode the points in the file.

This post describes how to import a series of GPX files, calculate some measures and output the data to a staging table that will be used later to build the fact table. The measures of interest are speed, distance and time; the GPX files themselves doesn’t contain this data so we’ll be calculating them using a combination of the Derived Column and Script Transformations.

Star Schema Design

Our final star schema will look like the picture below – three dimension tables (Time, Date and Location) linked to a Fact table containing our measures. The Fact table is linked to each Dimension twice as the calculations relate to two points in space and time so there will be a relationship for the first time/date/location and another for the second.

A design consideration is whether to keep date and time together or split them into two separate dimension tables. The grain of the data held in the Fact table is down to the second, so if date and time were stored in the same table the total number of rows would be (number of days) * 86400 seconds , whereas separating them requires only (number of days) + 86400 seconds . Therefore it’s much more efficient space wise to create two dimension tables.

Star schema GPX

Populating the Staging Table

The GPX_Staging table will be populated by the SSIS workflow, which at a high level performs the following tasks:

And this is what the final workflow looks like:

Let’s go through each component in turn.

Create GPX Staging Table

This Execute SQL Task just drops and recreates the staging table. Nothing too important.

Import GPX Files

A Foreach Loop Container, looping through all the GPX files in a directory (configurable via Project Parameters). The current file is saved in a user variable called GPXFileName .

Clear GPX Data

An Execute SQL Task to truncate the GPXData table which acts as a staging table for the data from a single GPX file.

Extract GPX Data

A Data Flow Task to pull out the GPX data from the XML file, add the location column then save it to the GPXData table. We covered this in Parts 1 to 3: the only difference is the GPX file name comes from the GPXFileName variable rather than being hard coded.

part4_extractgpxdataflow

Reverse Geocode

An Execute SQL task which, for each row in the GPXData table, reverse geocodes the location. Running the reverse geocode process at this point is more efficient than doing it later in the data flow at the location column is duplicated in the next stage.

Performing the lookup depends on the Location dimension table being populated, which I’ll cover in the next post. For now just assume the table exists and has the NUTS dataset loaded in.

The geocode SQL statement uses the STIntersects method of the geography datatype which we first came across in part 3.

Calculate Speed and Distance

part4_calculate_dataflow

The first part of this Data Flow Task is to match sequential rows in the GPXData table – these become the start and end records in the speed/distance/time calculation. To achieve this, two OLE DB Sources are used with a Merge Join Transformation to bring them together. The two data sources use the ROW_NUMBER() function to produce rows that can then be matched.

For the Merge Join to work the data sources need to be sorted. The ORDER BY clause is present in the SQL statements but there are two extra steps required to tell SSIS this is the case. First, bring up the “Advanced Properties” of one of the OLE DB Data Sources and set “IsSorted” to true under the Input and Output Properties tab (see screenshot below). Next find the row_num column under the Output Columns list then set the SortKeyPosition property to 1. Rinse and repeat for the second OLE DB Source.

part4_advanced

The Merge Join Transformation matches identical row_num values between the two data sets. The row_num column on the left hand goes from 1,2,3,4,5 and so on, the right from 0,1,2,3,4,5. In this way each row from the left is matched to the next one chronologically.

mockup

The Merge Join is configured to show all columns from the left and right hand sides – the left with the suffix ‘first’ and the right the suffix ‘next’.

Merge Join Transformation design

Merge Join Transformation design

The SQL geography class contains a function to calculate the distance between the two points in each row. Unfortunately SSIS lags far behind SQL Server and has no native support whatsoever for the geography data type. Remember in part 1 we had to use DT_IMAGE for the geography columns in SSIS? This is rubbish but at least the CLR is there to come to the rescue!

The Calculate Distance Between Two Points script component takes the first_location and next_location columns and outputs a single column total_distance of type DT_NUMERIC . The script itself is straightforward, converting from BlobColumn to SqlBytes as this is the parameter type for the SqlGeography.Deserialize() method.

The remaining calculations are much more straightforward using the Derived Columns Transformation. Here we can add the duration and speed (in km/h) as well as a couple of other columns that might be useful later on.

Derived column configuration

Derived column configuration

The final data flow component is the OLE DB Destination where the rows are saved to the GPX_Staging table.

Tying it all together

The workflow is now in place to populate the staging table with the raw data from the GPX file as well as computed columns. In the next post I’ll be creating the Dimension tables and populating the Fact table.

Posted in SQL Server, SSIS | 1 Comment

Bike tour Business Intelligence: Part 3 [Reverse Geocoding]

In Part 2 I modified the SSIS data flow to add a column containing the position of each trackpoint read from the GPX file in the SQL Server spatial type geography . Part three investigates SQL spatial types further and walks through a use case to build on the existing design: reverse geocoding. This will be useful in our final BI solution as it provides another dimension for analysis.

What is reverse geocoding?

Reverse geocoding is the process of turning a set of co-ordinates back into a readable place name. Depending on the granularity of data this could be a street name, town or city, country, continent, etc.

There are several ways to reverse geocode data: the most popular and simplest is using an map provider’s API, such as Bing or Google. They are fast and accurate and also extremely easy to use. But what’s the fun in that?

I want to roll my own for a couple of reasons: a) it’s a good way of learning about spatial data in SQL Server and b) an average day’s GPX contains roughly 3500 points. Multiply that by the 110 days I was cycling, that’s a lot of API calls. Most map providers give a limited number of API calls and that number’s not 385,000.

Geo datasets

To reverse geocode the co-ordinates we need a dataset containing the shapes queried when the question is asked “In which part of the world are these co-ordinates located?”.

There are lots of options with differing levels of detail, some free and many paid for. As my cycling was limited to Europe and I’m only interested in my location down to the region level, I’ll be using the free NUTS dataset published by the EU, in particular the 1:3 million shapefile (NUTS_2010_03M_SH.zip) [download].

The NUTS dataset comprises of 4 levels (0,1,2,3) ranging from the largest to smallest territorial units. The actual breakdown depends on the country and how it is divided geographically. For instance the entire UK is level 0, level 1 comprises of regions of England, level 2 is counties and level 3 authorities – see wiki for more.

The NUTS zip file contains (amongst other things) a shapefile which needs to be imported into SQL Server. This sounds like it should be an easy task but Microsoft provide zero tooling to help; instead you’ve got to choose between some powerful (and feature rich) applications which cost a fortune or use some freeware/open source tools which are harder to get your head around. As I quit my job to go cycling I think we’ll take option two.

Two free tools are Shape2SQL and ogr2ogr (part of osgeo4w). Shape2SQL has a nice user interface but I couldn’t get it to work with SQL 2014 (clicking ‘Upload to Database’ did nothing). ogr2ogr is extremely powerful but command line based and getting it to do what you want takes a bit of practice and research. Alastair Aitchison has a very useful blog post detailing how to install osgeo4w as well as a number of ogr2ogr recipes.

Importing the NUTS shapefile into SQL Server

Open up the osgeo4w shell. This sets the correct environment variables to allow ogr2ogr to run.

The command contains the following parameters:

  • -f “MSSQLSpatial” “connection_string” : output format as Microsoft SQL Spatial data with target database connection string
  • “NUTS_2010_03M_SH\Data\NUTS_RG_03M_2010.shp” : location of the NUTS shapefile
  • -a_srs “EPSG:4258” : source data EPSG is 4258
  • -t_srs “EPSG:4326” : target data EPSG is 4326 (to match the EPSG of the geography data from our GPX file – this is the 4326 used as the third parameter to SqlGeography.Point in Part 1)
  • -overwrite : overwrite existing table data
  • -lco “GEOM_TYPE=geography” : use SQL Server’s geography data type (instead of geometry, the default)
  • -nln “Nuts2008_Geog” : table name to store the imported shapefiles
  • -lco “GEOM_NAME=shape” : column name of geography data
  • -progress : indicate progress

The data should now be in the destination table but it won’t map properly in management studio due to the way some of the polygons are stored:

Screen Shot 2015-02-15 at 18.20.52

I don’t fully understand why but after several hours of research I figured out there are a couple more steps required to make the data display correctly. Run the following commands in the Nuts2008 database:

Now the shapes will be visible on the map surface in the spatial viewer tab. The data is now ready for use in the reverse geocoding process.

Screen Shot 2015-02-15 at 18.25.56

Reverse geocoding imported GPX points

Almost there!

The geography data type’s STIntersects method can be used to determine whether 2 geography instances intersect. In our case, for each imported GPX point we want to determine which shape (region) it intersects. The name of this shape will give us the name of the region where the point is located.

A spatial index makes the lookup much faster, I chose the GEOGRAPHY_AUTO_GRID option. It can be tuned further but for now the query’s fast enough for my requirements.

The query to reverse geocode each point from the GPX is as follows:

Screen Shot 2015-02-15 at 18.51.42
In my sample GPX the first few points are in administrative region CH052 which (after a manual lookup) is Schaffhausen in Switzerland.

Now looking up the place name manually isn’t that useful when you’re talking about almost 400,000 points. There must be a better way! And there is: I imported all the regions into a SQL Server table (source) then denormalized the data. For any level 3 region I can find out which level 2/1/0 regions it belongs to (import script available here: Nuts2008_Level3.sql).

Joining this new table with the nuts_id column from the Nuts2008_Geog table completes the reverse geocoding process.

In my sample GPX file, chosen because I spent the day border hopping, I passed back and forth through different regions of Germany and Switzerland:

Screen Shot 2015-02-19 at 21.11.56

Example

An updated SSIS project with the script component adding the geography column (from part 2) and sample GPX file is available to download here: Garmin-GPX-to-SQLServer-part3.zip.

Conclusion

We’ve figured out how to reverse-geocode each point in SQL Server so in part 4 we’ll be extending the workflow to include the additional metadata.

Posted in Spatial, SQL Server, SSIS | Tagged , , | 1 Comment

Bike Tour Business Intelligence: Part 2 [Geography]

My previous post covered using SSIS to read XML from a GPX file saving the data to a new table in a SQL Server Destination. This post will cover extending the SSIS workflow to create spatial data; the next on investigations into reverse-geocoding using SQL Server.

SQL Spatial

Spatial data has been part of SQL Server since 2008 with the introduction of two new data types: geometry and geography. These data types can be used to store points and shapes and SQL Server offers a number of built-in functions to reason with them, such as finding out the distance between points or if two shapes intersect. Much easier than implementing the code yourself!

Whether to use geometry or geography depends on what the data is being used for. There’s a good explanation of the difference on the SQL From the Trenches blog – essentially it’s flat earth vs round earth co-ordinate systems. As our data set relates to points on a map then the geography data type’s the right one.

Extending Data Flow to add geography data type

The XML data being imported already contains a latitude and longitude and we can use this to create an extra column in the SSIS data flow – a point of type geography.

This needs a new component to be added to the workflow. After the changes, the finished data flow should look like this:

Screen Shot 2015-02-14 at 12.19.13

(the eagle-eyed will notice that the destination type has changed since the previous post. This’ll be explained in due course)

Drag a new Script Component to the data flow and choose “Transformation” type.

Screen Shot 2015-02-14 at 11.35.09

Connect the new component to the “Read GPX XML” component then bring up the properties window. Under “Input Columns” select only the lat/lon columns then under the “Inputs and Outputs” page create a new output column named ‘location’. SSIS doesn’t natively support SQL Server’s geography data type so instead set the data type to ‘ image [DT_IMAGE]’.

Screen Shot 2015-02-14 at 12.18.40

Bring up the script editor and first add a reference to the CLR assembly which contains the definition for the geography type: Microsoft.SqlServer.Types. For some reason this isn’t in the GAC so you’ll need to browse to the DLL – on my SQL 2014 server it’s C:\Program Files\Microsoft SQL Server\120\Shared\Microsoft.SqlServer.Types.dll.

Next we need to write the code that adds the geography column to each row. Initialising the MemoryStream and BinaryWriter instances during the PreExecute() method is more efficient compared with unnecessarily creating a new instance for each input row and will help to make the workflow run as quickly as possible.

The Input0_ProcessInputRow() method is called once per row and adds a new instance of SqlGeography to the OutputBuffer.

Next delete the original SQL Server Destination and create an OLE DB Destination instead. The reason for this is using the existing SQL Server destination with a geography data type throws an error during execution:

Connect the “Add Geography Column” component to the new OLE DB Destination then bring up the properties window. Re-use the existing connection manager. As a new column’s been added, drop the destination table using SSMS then bring up the ‘New…’ window which shows the SQL code used to create a new version of the table. Give it a new name and change the data type of the location column from varbinary(max) to geography :

Map the columns in the ‘mappings’ page then save everything and run the workflow. If there are no errors then the extra column will appear when the table is queried in SQL Server.

Tip: Add an ‘Execute SQL task’ to the Control Flow, before the Data Flow Task, to truncate the GPXData table each time the workflow is run.

Viewing the data inside SSMS

Inside the results view in SSMS the location column looks like any other binary data, but when a resultset has a column containing spatial data SSMS adds a new ‘spatial results’ tab which renders all the points/shapes onto a map-like surface for visualisation.

The short track visualised inside SSMS

The short track visualised inside SSMS

In the next post we’ll be loading country geodata into SQL Server and writing queries to reverse-geocode our GPX files – in other words providing actual location data (such as country) for each of the points in our GPX file.

Posted in Spatial, SQL Server, SSIS | Tagged , , , | 1 Comment

Bike Tour Business Intelligence: Part 1 [Reading GPX files with SSIS]

Last year I spent the summer riding my bike across Europe, clocking up 6100km through 7 different countries. Riding through the countryside with the wind blowing through your hair is a great feeling and I had a fantastic time. Now I’m back and, being a data professional, I wanted to analyse the data logged by my Garmin Edge 800 GPS.

The goal of this project is to read the approximately 120 files generated by the GPS into a SQL Server database and use BI tools to see if I can discover anything interesting in the data. This will be a series of posts, starting with extracting the data from a single GPX file and into SQL Server.

Analysing the GPX files

The Garmin Edge creates GPX files (which are XML) following the GPX Exchange Format although with a couple of Garmin-only extensions. When the timer is started the device records a series of data fields every few seconds:

  • Date and Time
  • Location (lat and long)
  • Elevation
  • Temperature*
  • Cadence*

(* From studying a GPX file these fields aren’t always captured)

The data comes from a <trkpt> tag in the GPX, itself a child of the <trkseg> tag.

Something apparent from the data is there’s no mention of speed or distance. We’re going to need to calculate that ourselves later.

Reading the XML in SSIS

The goal of this task is to get SSIS to read the XML from a single GPX file and output it to a SQL Server destination. This approach is straightforward to test and only a simple extension is required to iterate over a series of GPX files.

I couldn’t figure out how to output the trkpt elements using the built-in XML handling in SSIS (it kept sending 0 rows through the pipeline), so let’s use a Script Component and some C# instead.

Create a new ‘file’ connection manager pointing to the GPX file to be read. Give it a sensible name (I have chosen ‘gpx-file’) then create a new Data Flow and add a Script Component, being sure to choose the Data Source component type, and choose a name. Screen Shot 2015-02-13 at 12.25.49 Double click the Script Component to bring up the properties window, then on the ‘Connection Manager’ tab add a new connection manager named ‘conn’, linked to the connection manager created earlier. Add the output columns, one for each piece of data we need:

Column Name Type
lat numeric [DT_NUMERIC] – precision 9, scale 6
lon numeric [DT_NUMERIC] – precision 9, scale 6
ele four-byte signed integer [DT_I4]
cad four-byte signed integer [DT_I4]
atemp four-byte signed integer [DT_I4]
loggedtime database timestamp with precision [DT_DBTIMESTAMP2]

Bring up the script editor window. The key method is CreateNewOutputRows() which is responsible for adding rows to the output buffer. In our case this involves reading the XML document and running an XPath query to retrieve all <trkpt> elements. For each element the XML is parsed and a saved to a new row in the output buffer.

Saving to a SQL Server Destination

Next create a new connection to the SQL Server database where the processed GPX data will be stored, then in the data flow create a new SQL Server Destination. Link the destination to the Script Source then open the properties of the newly created destination. Create a table to store the data – edit the default table to something more memorable.

The data flow should look like this: workflow Now it’s time to test. Run your SSIS package. It’ll probably fail, saying that you need to run the package as an administrator. Save everything, reopen SSDT as an admin then try again.

All green ticks!

All green ticks!

And if everything’s worked you’ll see the data in the table in SQL Server as well. To download the source and sample GPX file head over to my github account or grab from here.

Posted in SSIS | Tagged , | 1 Comment