# Bike Tour Business Intelligence: Part 4 [Populating a staging table]

A quick recap of the series so far: I’m building a cube using the data recorded by my Garmin Edge GPS during a 4 month bike tour last across Europe last year. So far we’ve analysed a single GPX file and used SSIS to import it into a SQL Server database, as well as using SQL Server’s spatial data functionality to reverse-geocode the points in the file.

This post describes how to import a series of GPX files, calculate some measures and output the data to a staging table that will be used later to build the fact table. The measures of interest are speed, distance and time; the GPX files themselves doesn’t contain this data so we’ll be calculating them using a combination of the Derived Column and Script Transformations.

## Star Schema Design

Our final star schema will look like the picture below – three dimension tables (Time, Date and Location) linked to a Fact table containing our measures. The Fact table is linked to each Dimension twice as the calculations relate to two points in space and time so there will be a relationship for the first time/date/location and another for the second.

A design consideration is whether to keep date and time together or split them into two separate dimension tables. The grain of the data held in the Fact table is down to the second, so if date and time were stored in the same table the total number of rows would be (number of days) * 86400 seconds , whereas separating them requires only (number of days) + 86400 seconds . Therefore it’s much more efficient space wise to create two dimension tables. ## Populating the Staging Table

The

`GPX_Staging`

table will be populated by the SSIS workflow, which at a high level performs the following tasks:

```foreach GPX file in folder {
read GPX file into a small staging table adding the lat/lon as a geography column
reverse-geocode the geography column to provide a human-readable location
determine neighbouring points then calculate time, distance and speed
insert rows into staging table
}```

And this is what the final workflow looks like: Let’s go through each component in turn.

## Create GPX Staging Table

This Execute SQL Task just drops and recreates the staging table. Nothing too important.

## Import GPX Files

A Foreach Loop Container, looping through all the GPX files in a directory (configurable via Project Parameters). The current file is saved in a user variable called GPXFileName .

### Clear GPX Data

An Execute SQL Task to truncate the

`GPXData`

table which acts as a staging table for the data from a single GPX file.

### Extract GPX Data

A Data Flow Task to pull out the GPX data from the XML file, add the

`location`

column then save it to the

`GPXData`

table. We covered this in Parts 1 to 3: the only difference is the GPX file name comes from the GPXFileName variable rather than being hard coded. ### Reverse Geocode

An Execute SQL task which, for each row in the

`GPXData`

table, reverse geocodes the location. Running the reverse geocode process at this point is more efficient than doing it later in the data flow at the location column is duplicated in the next stage.

Performing the lookup depends on the Location dimension table being populated, which I’ll cover in the next post. For now just assume the table exists and has the NUTS dataset loaded in.

The geocode SQL statement uses the

`STIntersects`

method of the geography datatype which we first came across in part 3.

```update GPXData
set LocationSK =
(select DimLocation.LocationSK from DimLocation where
GPXData.location.STIntersects(DimLocation.Shape) = 1)
```

### Calculate Speed and Distance The first part of this Data Flow Task is to match sequential rows in the GPXData table – these become the start and end records in the speed/distance/time calculation. To achieve this, two OLE DB Sources are used with a Merge Join Transformation to bring them together. The two data sources use the ROW_NUMBER() function to produce rows that can then be matched.

```select ROW_NUMBER() over (order by loggedtime asc) as row_num, *
from GPXData
order by loggedtime asc```
```select (ROW_NUMBER() over (order by loggedtime asc) - 1) as row_num, *
from GPXData
order by loggedtime asc```

For the Merge Join to work the data sources need to be sorted. The ORDER BY clause is present in the SQL statements but there are two extra steps required to tell SSIS this is the case. First, bring up the “Advanced Properties” of one of the OLE DB Data Sources and set “IsSorted” to true under the Input and Output Properties tab (see screenshot below). Next find the row_num column under the Output Columns list then set the SortKeyPosition property to 1. Rinse and repeat for the second OLE DB Source. The Merge Join Transformation matches identical row_num values between the two data sets. The row_num column on the left hand goes from 1,2,3,4,5 and so on, the right from 0,1,2,3,4,5. In this way each row from the left is matched to the next one chronologically. The Merge Join is configured to show all columns from the left and right hand sides – the left with the suffix ‘first’ and the right the suffix ‘next’.

The SQL geography class contains a function to calculate the distance between the two points in each row. Unfortunately SSIS lags far behind SQL Server and has no native support whatsoever for the geography data type. Remember in part 1 we had to use DT_IMAGE for the geography columns in SSIS? This is rubbish but at least the CLR is there to come to the rescue!

The Calculate Distance Between Two Points script component takes the first_location and next_location columns and outputs a single column total_distance of type DT_NUMERIC . The script itself is straightforward, converting from BlobColumn to SqlBytes as this is the parameter type for the SqlGeography.Deserialize() method.

```[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
///
/// Deserialize the two geography objects (for first and last location)
/// then use the STDistance method to calculate the distance between the two points.
///
/// The row that is currently passing through the component
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.firstlocation_IsNull || Row.nextlocation_IsNull)
{
Row.totaldistance_IsNull = true;
return;
}

//Convert the firstlocation/nextlocation BlobColumn to type SqlBytes for deserialization.
SqlBytes bytes = new SqlBytes(Row.firstlocation.GetBlobData(0, (int)Row.firstlocation.Length));
SqlGeography start = SqlGeography.Deserialize(bytes);

bytes = new SqlBytes(Row.nextlocation.GetBlobData(0, (int)Row.nextlocation.Length));
SqlGeography end = SqlGeography.Deserialize(bytes);

if (start != null && end != null)
{
//Round total distance to 5 decimal places
Row.totaldistance = Convert.ToDecimal(Math.Round(start.STDistance(end).Value, 5));
}
else
{
Row.totaldistance_IsNull = true;
}
}
}
```

The remaining calculations are much more straightforward using the Derived Columns Transformation. Here we can add the duration and speed (in km/h) as well as a couple of other columns that might be useful later on.

The final data flow component is the OLE DB Destination where the rows are saved to the GPX_Staging table.

## Tying it all together

The workflow is now in place to populate the staging table with the raw data from the GPX file as well as computed columns. In the next post I’ll be creating the Dimension tables and populating the Fact table.

This entry was posted in SQL Server, SSIS. Bookmark the permalink.