An Integration Services (SSIS) project was created
in Visual Studio 2022 with the name
ETL.Bellabeat.SSIS (SQL Server 2019).
Its purpose is to build automated data and control flows that orchestrate the complete ETL process (Extract – Transform – Load).
The core of the project contains 11 load packages, each dedicated to a single CSV source file. Each package includes:
A master package orchestrates the sequential execution of the 11 load packages in order to provide a complete, stable, and reproducible pipeline.
I implemented a typed RAW approach: the tables that receive the loaded data are created with their final data types and constraints (PK, NOT NULL), instead of using an intermediate staging layer in plain NVARCHAR. This strategy allows quality to be enforced as early as the CSV load.
For each package:
CREATE TABLE statement.PRIMARY KEY;NOT NULL on mandatory columns;These tables form the typed RAW layer, used directly in the ETL.
I followed a structured approach to determine SQL Server data types based on the actual granularity of the CSVs.
Id columns → BIGINT
I chose BIGINT for Id columns because:
INT;BIGINT performs well in analytical joins.Decimal columns (distance, METs, weight, BMI) → DECIMAL(p, s)
I chose the precision (p) and scale (s) by inspecting the real values in the CSVs:
Integer numeric columns (steps, minutes, calories,
intensities…)
→ DT_I4 in SSIS, then INT in
SQL Server.
Date/time columns from the CSVs
→ DT_DBTIMESTAMP in SSIS, then
DATETIME (or DATETIME2 if
required) in SQL Server.
These choices ensure fidelity to the source files, which is essential for the analytical project.
| Usage (Bellabeat) | SSIS type (Data Flow) | SQL Server type created in RAW tables |
|---|---|---|
Identifiers Id, LogId |
DT_I8 (64-bit integer) |
BIGINT |
| Technical keys, counters, minutes | DT_I4 (32-bit integer) |
INT |
| Distance, METs, weight, BMI, decimal values | DT_NUMERIC |
DECIMAL(p, s) (p, s based on actual granularity) |
| Full date/time columns | DT_DBTIMESTAMP |
DATETIME |
| Date-only columns | DT_DBDATE |
DATE |
| Time-only columns (when used) | DT_DBTIME |
TIME |
| Short text (e.g. categories, labels) | DT_STR |
VARCHAR(n) |
| Unicode text | DT_WSTR |
NVARCHAR(n) |
Each CSV file is handled in a dedicated package:
01_Load_*.dtsx.
Common structure of the Data Flow Tasks:
Flat File Source;OLE DB Destination (fast load) → corresponding typed
RAW table.This file contains duplicate rows.
In the package 01_Load_MinuteSleep.dtsx:
(Id, Date, LogId);This component ensures that only unique rows are sent to the RAW table.
This is a critical step: without deduplication, the primary key would fail in SQL Server.
Because the Bellabeat CSV files are formatted using US conventions (en-US), I had to change the locale of the Flat File Connection Manager from my default French (fr-FR) environment to English (United States).
The French locale uses: - commas as decimal separators
(3,57), - day-first dates (dd/MM/yyyy), -
semicolons as separators (;),
while the Bellabeat files use: - dots as decimals
(3.57), - US dates (MM/dd/yyyy), - commas as
separators (,).
If I had kept the French locale, SSIS would not parse numbers or
dates correctly, leading to Data Conversion Failed
errors when converting to INT, DECIMAL, or
DATETIME.
I also replaced the default ANSI 1252 code page with UTF-8, because the CSV files are encoded in UTF-8. Keeping ANSI could cause character corruption, parsing issues, or column misalignment.
Using en-US + UTF-8 ensures that SSIS reads the CSV structure correctly and loads clean, valid data into the RAW SQL layer.
The following mechanisms were implemented to guarantee data quality during the load.
NOT NULL, PRIMARY KEY).BIGINT / INT /
DECIMAL / DATETIME;DATETIME;Id is present in every file.A master package 04_Master_Load_Bellabeat.dtsx orchestrates the 11 packages:
This orchestration guarantees a reproducible execution of the whole ETL pipeline.
After loading, several checks were performed:
Comparison between:
SQL Server types match the source formats exactly:
BIGINT;DECIMAL(p, s);INT;DATETIME / DATETIME2.Id in each table;Id values shared across
granularities.SQL constraints and primary keys guarantee:
The data is now ready for exploratory data analysis (EDA), analytical SQL queries, aggregated views, or BI modelling.
The Bellabeat ETL relies on a typed RAW layer
generated via SSIS OLE DB Destinations, strengthened with optimised data
types (BIGINT, DECIMAL) and strict SQL
constraints. The 11 load packages structure the CSV data into a stable,
controlled schema, including duplicate removal for MinuteSleep. The
master package orchestrates the entire chain, providing a reproducible
and reliable pipeline that meets the requirements of a structured
analytics project.
┌──────────────────────────────┐
│ Project ETL.Bellabeat.SSIS │
│ Visual Studio 2022 │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ Master Package (04_Master) │
│ Orchestration of 11 packages │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ Packages 01_Load_* │
│ (Daily, Hourly, Minute, │
│ WeightLogInfo) │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ Flat File Source │
│ Read CSV │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ SSIS Transformations │
│ - Type conversion │
│ - Derived Column │
│ - Cleaning │
│ MinuteSleep: Sort + Dedup │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ OLE DB Destination │
│ Table creation + typing │
│ PK / NOT NULL / DECIMAL / │
│ BIGINT / DATETIME │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ Typed RAW SQL layer │
│ bellabeat_data (11 tables) │
└───────────────┬──────────────┘
│
▼
┌──────────────────────────────┐
│ Data ready for EDA, SQL │
│ and BI modelling │
└──────────────────────────────┘
To manage the database schema in a structured and version-controlled
way, I added a SQL Server Database Project (SSDT) to
the Visual Studio 2022 solution.
The project was initialized with the following configuration:
SSDT then automatically generates a .dacpac file during
the build process. This file represents a complete and consistent
version of the SQL schema. The project can be versioned in Git, ensuring
accurate tracking of changes and enabling the database to be rebuilt
identically across any environment.
To ensure reliable, traceable, and reproducible management of the database used by the Bellabeat ETL, I chose to use an SSDT (SQL Server Data Tools) project rather than creating tables manually in SQL Server Management Studio. With SSDT, the entire data model—tables, columns, data types, keys, constraints, and indexes—is defined directly in Visual Studio and versioned in Git.
The project is published through a DACPAC, a deployment artifact that automates the creation or update of the bellabeat_data database. This approach eliminates manual discrepancies between environments and ensures an infrastructure that can be rebuilt identically, which is essential in an ETL workflow.
A DACPAC (Data-tier Application Component Package)
contains only the schema of a SQL Server database,
which in this case corresponds to a typed RAW layer,
including:
- tables and columns,
- data types,
- primary keys,
- views.
It does not contain any data—only the structure. The DACPAC represents a consistent snapshot of the schema as defined in SSDT.
Once the bellabeat_data database is deployed via the DACPAC, the 11 SSIS packages—coordinated by a master package—load the CSV files into the tables defined by SSDT.
This schema-first approach ensures that column typing is controlled, dates are standardized, and primary keys are enforced from the start. SSIS therefore loads the data into a clean, consistent, and analytics-ready environment.