At this stage the BI database contains the Date, Time and Location Dimension tables as well as a staging table with all the points captured by my GPS. This post covers creating and populating the Fact table, as well as running some queries over the dataset.
To recap, the star schema design has three Dimension tables all related to the Fact table, as shown below:
Creating the Fact Table
The Fact table creation flow involves two steps – the creation and population of the Fact table itself, followed by the creation of a clustered index (on all key columns) and the foreign key relationships to the dimension tables.
The SQL SELECT .. INTO query makes it easy to create and populate the Fact table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT d1.DateSK AS StartDateSK, t1.TimeSK AS StartTimeSK, d2.DateSK AS EndDateSK, t2.TimeSK AS EndTimeSK, first_locationsk AS StartLocationSK, next_locationsk AS EndLocationSK, total_distance AS DistanceMetres, duration AS DurationSecs, speed_kmh AS SpeedKmh INTO FactGPX FROM GPX_Staging INNER JOIN DimDate d1 ON d1.Date = CONVERT(date,first_loggedtime) INNER JOIN DimDate d2 ON d2.Date = CONVERT(date,next_loggedtime) INNER JOIN DimTime t1 ON t1.Time = CONVERT(time, first_loggedtime) INNER JOIN DimTime t2 ON t2.Time = CONVERT(time, next_loggedtime) ORDER BY first_loggedtime ASC |
With some additional Execution SQL Tasks to prepare the database for the data load and to clear down staging tables, the final SSIS workflow is complete (click to view the full workflow):
Running it against all my GPX files results in a Fact table of just over 350,000 rows.
Some sample queries
With all the data loaded it’s hard to resist writing some queries…. which I did, and then promptly found some outlying data points. Cycling in the Czech Republic was not a highlight and we took a shortcut through Prague by train. As I still had my GPS recording, it captured the journey leaving a couple of 50km+ segments which skew the data. I added a filter to the Fact table generation query so that only points under 10km apart are imported, ensuring only data generated by cycling is available for analysis.
What days did we cycle the farthest?
1 2 3 4 5 6 7 |
SELECT d.StandardDate AS [Date], (SUM(f.DistanceMetres) / 1000) AS [TotalDistanceKm] FROM FactGPX f JOIN DimDate d ON d.DateSK=f.StartDateSK GROUP BY d.StandardDate ORDER BY (SUM(f.DistanceMetres) / 1000) DESC |
What’s the longest daily ride per country?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
;WITH CTE(Date, CountryName, TotalDistanceKm, Ranking) AS ( SELECT d.StandardDate AS [Date], l.CountryName, (SUM(DistanceMetres) / 1000) AS [TotalDistanceKm], RANK() OVER (PARTITION BY CountryName ORDER BY (SUM(DistanceMetres) / 1000) DESC) AS Ranking FROM FactGPX f JOIN DimDate d ON d.DateSK=f.StartDateSK JOIN DimLocation l ON l.LocationSK=f.StartLocationSK GROUP BY l.CountryName, StandardDate ) SELECT Date, CountryName, TotalDistanceKm FROM CTE WHERE Ranking = 1 ORDER BY TotalDistanceKm DESC |
Did we ride further in the morning morning or the afternoon?
1 2 3 4 5 6 7 |
SELECT t.AmPm AS [AmPm], (SUM(f.DistanceMetres) / 1000) AS [TotalDistanceKm] FROM FactGPX f JOIN DimTime t ON t.TimeSK=f.StartTimeSK GROUP BY t.AmPm ORDER BY (SUM(f.DistanceMetres) / 1000) DESC |
Which day of week did we have the highest moving average speed (assume moving is speed > 5kmh)?
1 2 3 4 5 6 7 8 |
SELECT d.DayOfWeek, (SUM(f.DistanceMetres) / SUM(f.DurationSecs) * 3.8) AS [AverageSpeedKmh] FROM FactGPX f JOIN DimDate d ON d.DateSK=f.StartDateSK WHERE f.SpeedKmh > 5 GROUP BY d.DayOfWeek ORDER BY (SUM(f.DistanceMetres) / SUM(f.DurationSecs) * 3.8) DESC |
Answer…. none in particular!
Summary so far
The DW database now contains all the points recorded by my GPS – next up I will be creating an Analysis Services Cube and performing some analytics over the dataset.
2 Responses to Bike Tour Business Intelligence: Part 6 [Building the Fact Table]