Note: This is a case study for the completion of my Google Data Analytics certificate. All data used are real and can be downloaded from Kaggle provided by Mobius. To view the source code to produce this work, please visit my repository on GitHub. This whole case study takes 15-20 minutes to read thoroughly.
Here is the link to the Executive Summary.
Founded in 2014 by Urška Sršen and Sandro Mur, Bellabeat is a high-tech wellness company that manufactures products specifically designed and engineered to improve the health of women. Bellabeat offers various products that can monitor the productivity, activity, and reproductive health of women. While Bellabeat is just a small company, it is successful and is capable of expanding its business in the global smart device market. To do this, Urška Sršen has urged the data analytics team to analyze the fitness data of smart devices so that the company can open up opportunities for growth. In this scenario, I am acting as part of the data analytics team for Bellabeat, tasked to focus on one of Bellabeat’s products as well as gain insights from the analysis I discover. I will ultimately provide my recommendations to the executives to help drive the marketing strategy for the company.
As this is a case study for my Google Data Analytics certificate, I will be incorporating the following phases of the data life cycle: Ask, Prepare, Process, Analyze, Share, and Act.
Business Task: Analyze general customer usage of fitness tracking devices to uncover trends and generate insights that can be applied to Bellabeat’s Time watch and improve its overall sales.
The following are the key stakeholders:
Data Source: FitBit Fitness Tracker Data, licensed under CC0 public domain, was downloaded from Kaggle through Mobius. The extracted file contained 18 csv files, 11 of which I used to cross reference data to each other. However, in the end, I will ultimately be using only 8 of the files for my analysis due to duplicate data. Below are the names of the files with a brief description:
| Original File Name | Unique Id’s | Description | 
| dailyActivity_merged | 33 | Consolidated version of data from various files (but not all) spanning 31 days | 
| dailyCalories_merged | 33 | Calories per day over 31 days | 
| dailyIntensities_merged | 33 | Intensities (ex: lightly active, fairly active, etc.) over 31 days | 
| dailySteps_merged | 33 | Steps per day over 31 days | 
| heartrate_seconds_merged | 14 | Heart rate per second over 31 days | 
| hourlyCalories_merged | 33 | Calories burned per hour over | 
| hourlyIntensities_merged | 33 | Total intensity and average intensity per hour over 31 days | 
| hourlySteps_merged | 33 | Steps per hour over 31 days | 
| sleepDay_merged | 24 | Total minutes asleep, total time in bed and how many times user went to bed over 31 days | 
| minuteSleep_merged | 24 | Sleep quality (?) per minute and logid (?) over 31 days | 
| weightLogInfo_merged | 8 | Daily weight in lbs, kg, and BMI over 31 days | 
Before moving on to cleaning the data, I must first address key concerns regarding the dataset, such as its privacy and security. Since the data is in the public domain under the CC0 license, I am permitted to use it. It is provided by Mobius and can be downloaded from Kaggle.
This data was collected between March 12, 2016 and May 12, 2016 by Amazon Mechanical Turk, a crowdsourcing marketplace that is usually very reliable. However, there were several issues regarding the bias, credibility and integrity of this dataset:
The descriptions claims that thirty eligible FitBit users consented to the submission of personal tracker data.” However, a quick preliminary data check revealed that there were actually thirty-three unique ID’s collected in the data. Was this simply an error in the description with rounding of number? Or did the extra ID’s collected come from individuals who did not consent to the submission of personal tracker data?
The dataset is not only outdated, but it also supposedly contained data only between April 12, 2016 and May 12, 2016. This is in contrast to the description that claimed the data contains data between March 12, 2016 and May 12, 2016.
A sample size of 33 participants is relatively low compared to the billions of women in the world. Other files contained as little as 8 different users with 67 records.This could result in bias and distort the result of the analysis.
It wasn’t stated in the dataset description how the data was collected. Could the participants all have resided in a specific area or country? Were they all from a particular group, such as a fitness class? The sample population may not be representative of the whole female population who would benefit from Bellabeat’s products.
The fields and values are ambiguous; nothing in the description of the dataset clarifies them. The following are questions I would ask from the appropriate stakeholder or source regarding the following files:
▸ ‘dailyActivity_merged’: what is the difference between the fields “TotalDistance,’ “TrackerDistance,” and “LoggedActivitiesDistance” Also, are the values measured in kilometers or miles? The same goes for columns ‘SedentaryActiveDistance’ and etc.
▸ ‘hourlyIntensities_merged’: what do the values in the columns ‘TotalIntensity’ and ‘AverageIntensity’ represent?
▸ ‘minuteSleep_merged’: what do the values in the column ‘Value’ represenent?
▸ ‘sleepDay_merged’: The column ‘Time’ only shows strictly ’12:00:00’ and ‘AM.’ Does this mean that the column ‘TotalMinutesAsleep’ doesn’t include the current sleep session from the time “before midnight”? This can provide misleading data since if one were to sleep tonight from 10PM – 6AM and didn’t sleep the next night, the data may only record 6 hours of sleep (from 12AM-6AM) for tomorrow instead of 8 hours of sleep for today and 0 hours tomorrow.
In addition, I am only using one dataset to conduct my analysis. Perhaps I could obtain more data elsewhere then aggregate the results but for the purpose of this case study, I will continue my analysis with the data I already have.
Below are samples codes ran in R to identify the FitBit dataset’s credibility and integrity:
The following code just loads the ‘tidyverse’ package within R to provide various tools that help me conduct my analysis. Note the number of rows (940) for this specific data. This means there are 940 records in total.
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorssetwd('C:\\Users\\dalea\\Documents\\Coursera\\Google Data Analytics\\FitBit Wellness Tracker Dataset\\Fitabase Data 4.12.16-5.12.16\\No longer needed')
# Create first raw table for pre-processing
FitBit_data_raw <- read_csv('dailyActivity_merged.csv')## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.The glimpse() function shows that there are 33 rows. This means that there are 33 unique IDs for this data.
# Create table
daily_activity_unique_id <- unique(select(FitBit_data_raw, Id)) 
# Check data
glimpse(daily_activity_unique_id)## Rows: 33
## Columns: 1
## $ Id <dbl> 1503960366, 1624580081, 1644430081, 1844505072, 1927972279, 2022484…The following code checks the amount of records for each particular date. Note the number of rows which represent the different dates as well as the data type ‘date’. The count represents the amount of records for the specific date.
# Group data
date_count <- FitBit_data_raw %>% 
  group_by(ActivityDate) %>% 
  summarize(count = n())
# Convert to date format  
date_count$ActivityDate <- as.Date(date_count$ActivityDate, format = '%m/%d/%Y')
# Check data to make sure data format is as intended 
glimpse(date_count) ## Rows: 31
## Columns: 2
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ count        <int> 33, 33, 33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 3…Let’s visualize the amount of records by date into a chart.
# Plot graph
ggplot(date_count, aes(x = ActivityDate, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +  
  labs(x = "Date", y = "Count", title = "Number of Counts by Date") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        plot.title = element_text(size = 15, hjust = 0.5))  As you can see, less and less data were being recorded towards the end of the data collection (2016-05-12). It is best to filter out the last day (2016-05-12). This is because the low records can skew results tremendously and provide misleading data. I think the users just lost motivation to record their progress towards the end of the data collection.
Before I can analyze the dataset, I had to make sure the data is clean. Below are the following actions I took to clean the data:
I opened every .csv file in excel then filtered all the columns and checked for the presence of “Blanks.” There were no blanks in all of the dataset except for the column ‘Fat’ in the file ‘weightLogInfo_merged.’ Since this specific file had only 68 max records and the ‘Fat’ column only had 2 records, I will skip this field from analysis.
While I can perform data cleaning in R as well, I will showcase my ability to process large files in SQL. In this case, we will clean the file ‘heartrate_seconds_merged’ in BigQuery since it has 2.48 millions of rows of data. Otherwise, using Excel will result in data loss.
After importing the ‘heartrate_seconds_merged.csv’ file into BigQuery, I performed the following steps to clean then download the data:
1) Since using the ‘auto detect’ feature for the schema resulted in job errors, I had to manually enter the schema in JSON format.
2) Created a schema using the following JSON format
3) Used CAST() function to “permanently” transform the data into the correct data type (for some reason, saving the data as .csv then verifying the data types in R showed “string” format for all columns)
4) File size was too large to save the whole data locally (2,483,658 rows)
5) Exported the data as a table in BigQuery to save the whole data
6) I then exported the table to Google Cloud Storage so I could download the whole data
7) Data was split into multiple .csv files so I downloaded all of them for analysis in R
8) Combined each .csv file into one large .csv file using R
9) Double check that the new file ‘heartrate_seconds_merged_cleaned.csv’ contains the original 2,483,658 rows but with the correct format and data types this time
The output of the code below reveals that the data from ‘dailyIntensities_merged’ is mostly the same as a portion of the data from ‘dailyActivity_merged.’
setwd('C:/Users/dalea/Documents/Coursera/Google Data Analytics/FitBit Wellness Tracker Dataset/Fitabase Data 4.12.16-5.12.16/No longer needed')
# Create intensities table
FitBit_intensities <- read_csv('dailyIntensities_merged.csv')## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.# Create intensities table from raw table (see 2.4.1)
intensities_from_raw <- select(FitBit_data_raw, Id, ActivityDate, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance)
# Create duplicate of intensities table but following specific order of columns 
intensities_from_intensities <- select(FitBit_intensities, Id, ActivityDay, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, SedentaryActiveDistance, LightActiveDistance, ModeratelyActiveDistance, VeryActiveDistance)
# Compare the two previous tables to check for exact match
result_intensities <- unique(intensities_from_raw == intensities_from_intensities)
result_intensities##         Id ActivityDate SedentaryMinutes LightlyActiveMinutes
##  [1,] TRUE         TRUE             TRUE                 TRUE
##  [2,] TRUE         TRUE             TRUE                 TRUE
##  [3,] TRUE         TRUE             TRUE                 TRUE
##  [4,] TRUE         TRUE             TRUE                 TRUE
##  [5,] TRUE         TRUE             TRUE                 TRUE
##  [6,] TRUE         TRUE             TRUE                 TRUE
##  [7,] TRUE         TRUE             TRUE                 TRUE
##  [8,] TRUE         TRUE             TRUE                 TRUE
##  [9,] TRUE         TRUE             TRUE                 TRUE
## [10,] TRUE         TRUE             TRUE                 TRUE
## [11,] TRUE         TRUE             TRUE                 TRUE
## [12,] TRUE         TRUE             TRUE                 TRUE
## [13,] TRUE         TRUE             TRUE                 TRUE
## [14,] TRUE         TRUE             TRUE                 TRUE
##       FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
##  [1,]                TRUE              TRUE                    TRUE
##  [2,]                TRUE              TRUE                    TRUE
##  [3,]                TRUE              TRUE                    TRUE
##  [4,]                TRUE              TRUE                    TRUE
##  [5,]                TRUE              TRUE                    TRUE
##  [6,]                TRUE              TRUE                   FALSE
##  [7,]                TRUE              TRUE                   FALSE
##  [8,]                TRUE              TRUE                    TRUE
##  [9,]                TRUE              TRUE                    TRUE
## [10,]                TRUE              TRUE                   FALSE
## [11,]                TRUE              TRUE                   FALSE
## [12,]                TRUE              TRUE                    TRUE
## [13,]                TRUE              TRUE                   FALSE
## [14,]                TRUE              TRUE                   FALSE
##       LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
##  [1,]               FALSE                    FALSE              FALSE
##  [2,]               FALSE                    FALSE               TRUE
##  [3,]               FALSE                     TRUE              FALSE
##  [4,]                TRUE                     TRUE               TRUE
##  [5,]               FALSE                     TRUE               TRUE
##  [6,]               FALSE                     TRUE               TRUE
##  [7,]               FALSE                    FALSE              FALSE
##  [8,]                TRUE                    FALSE               TRUE
##  [9,]                TRUE                    FALSE              FALSE
## [10,]               FALSE                    FALSE               TRUE
## [11,]               FALSE                     TRUE              FALSE
## [12,]                TRUE                     TRUE              FALSE
## [13,]                TRUE                     TRUE              FALSE
## [14,]                TRUE                    FALSE              FALSELet’s investigate why some fields are not an exact match and are showing up FALSE on the comparison test.
I’ll start with a logical comparison test from the common column ‘SedentaryActiveDistance.’
# Extract columns from both tables and insert to two new tables
activities_from_raw <- select(FitBit_data_raw, SedentaryActiveDistance)
activities_from_intensities <- select(FitBit_intensities, SedentaryActiveDistance)
# Check if FALSE value is present
result_activity <- unique(activities_from_raw == activities_from_intensities) #Compares values
result_activity##      SedentaryActiveDistance
## [1,]                    TRUE
## [2,]                   FALSETo figure out why the logical value from certain rows result in FALSE, I had to locate which rows resulted in the FALSE logical value. I modified the previous code to the following:
# Convert to data frame and add new column with row numbers to locate problematic rows
result_activity <- data.frame(activities_from_raw == activities_from_intensities) %>% 
  mutate(row_number = row_number()) 
head(result_activity)##   SedentaryActiveDistance row_number
## 1                    TRUE          1
## 2                    TRUE          2
## 3                    TRUE          3
## 4                    TRUE          4
## 5                    TRUE          5
## 6                    TRUE          6# Filter for fields that return FALSE to determine row numbers
false_only <- result_activity %>% 
  filter(SedentaryActiveDistance == FALSE)
head(false_only)##   SedentaryActiveDistance row_number
## 1                   FALSE         34
## 2                   FALSE         37
## 3                   FALSE         38
## 4                   FALSE         41
## 5                   FALSE         43
## 6                   FALSE         45I located these rows in excel and noticed that the rows that contained FALSE values all contained decimals (Ex: Cell J35). See the image below:
I realized that the issue was likely due to floating-point precision differences between the values stored in the Excel files and how R represents those values internally. To fix this, I can simply use the near() function to provide the level of tolerance for decimals.
# Create tolerance up to 2 digits (least number of decimals)
tolerance <- 1e-2
# Sample calculation
c <- 0.00999999977648258
d <- 0.01
# Output is 'TRUE' because value of c rounds up to 0.01 since tolerance is 2 digits
e <- near(d,c, tolerance)
e## [1] TRUE# Compare the values again with tolerance of 2 digits
check_logical <- near(intensities_from_raw$SedentaryActiveDistance, intensities_from_intensities$SedentaryActiveDistance, tolerance)
check_logical##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [406] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [436] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [451] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [466] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [481] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [496] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [511] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [526] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [541] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [556] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [571] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [586] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [601] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [616] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [631] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [646] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [661] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [676] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [691] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [706] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [721] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [736] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [751] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [766] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [781] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [796] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [811] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [826] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [841] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [856] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [871] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [886] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [901] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [916] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [931] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUEAlthough we’ve determined the cause of the discrepancies, we don’t actually need the data for the distance columns like ‘SedentaryActiveDistance.’ As mentioned in section 2.3, it is unclear what the values in these columns represent. We will not be analyzing these columns for this case study.
# Create table
FitBit_calories <- read_csv('dailyCalories_merged.csv')
# Create different tables to compare the two 
calories_from_raw <- select(FitBit_data_raw, Calories)
calories_from_calories <- select(FitBit_calories, Calories)
# Compare the tables to check for exact match
result_calories <- unique(calories_from_raw == calories_from_calories)
result_caloriesThe result of the code shows that there is only one unique value (‘TRUE’) with the logical comparison test for the calories column. This means the two columns from the two tables are an exact match.
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.##      Calories
## [1,]     TRUEFitBit_steps <- read_csv('dailySteps_merged.csv')
steps_from_raw <- select(FitBit_data_raw, TotalSteps)
steps_from_steps <- select(FitBit_steps, StepTotal)
result_steps <- unique(steps_from_raw == steps_from_steps)
result_stepsThe result of the code shows that there is only one unique value (‘TRUE’) with the logical comparison test for the steps column. This means the two columns from the two tables are an exact match.
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.##      TotalSteps
## [1,]       TRUEI used the ‘Remove Duplicates’ Tool from Excel to remove duplicate entries for the following two files:
I transformed and converted the data type ‘ActivityDate’ column from ‘dailyActivity_merged’ file to date format then created new file ‘dailyActivity_merged_cleaned’
Steps:
saved the original ‘dailyActivity_merged’ .csv file as another .csv file with different name ‘dailyActivity_merged_cleaned.’ This will now be the file that I’ll be using for analysis
split the values in the ActivityDate field using “/” as the delimiter to separate the month, day and year into separate columns using the text-to-column tool
created a new column ‘T’ that combined the values from the new columns in a date format
replaced the old ActivityDate column with the values from column ‘T’ then renamed new column name to ‘Date’
deleted the ‘transition columns’ (columns Q, R, S and T) as they are no longer needed
I split the values in column ‘SleepDay’ from ‘sleepDay_merged’ according to day and time then transformed the data type to date and time
Steps:
created a duplicate file then name it to ‘sleepDay_merged_cleaned.’ This is now the file I’ll be using for analysis
split the ‘SleepDay’ column into three new columns ‘Date’, ‘Time’ and ‘AM_or_PM’ using the Text to Columns tool then transformed the values into the correct day/time format
I split the values in ‘date’ column from ‘minuteSleep_merged’ file into three columns with the correct data types then capitalized column names for consistency
Steps:
duplicated file then named it ‘minuteSleep_merged_cleaned.’ This is the file I’ll be using now for analysis
split the values in ‘Date’ column into three columns ‘Date’, ‘Time’, and ‘AM_or_PM’ then changed data type accordingly
modified the column names by capitalizing the first letter to be consistent with other columns from other files
The image below shows the final cleaned version:
I split the values in ‘Date’ column from ‘weightLogInfo_merged’ file into three columns with the correct data types
Steps:
created duplicate file with filename ‘weightLogInfo_merged_cleaned.’ This will now be the new file I’ll be using to conduct my analysis
split the values in ‘Date’ column into three columns ‘Date’, ‘Time’, and ‘AM_or_PM’ then changed data type accordingly
note: this is the only file that contains blanks or missing values. I will not be analyzing the column ‘Fat’
The image below is the final cleaned version of the file:
I splitting the date values into ‘Date’, ‘Time’, and ‘AM_or_PM’ for ‘hourlyIntensities_merged’ and ‘hourlyCalories_merged’
I repeated the cleaning process of splitting the date values into ‘date’, ‘time’ and ‘AM_or_PM’ columns then transformed the data type accordingly for the following files:
Steps:
I created a duplicate file then renamed it as the originally file but adding ‘_cleaned’ to the end of it
performed the same data transformation steps as in the previous examples then verified in R that the columns are of the correct data type
According to my analysis, I propose three marketing strategy recommendations for Bellabeat’s Time watch, each substantiated by data and accompanied by actionable steps.
| Data | FitBit users typically start going to bed by 8:00 pm, with most users waking up starting at 05:30 am. Additionally, the average sleep duration is approximately 7 hours per day, with users getting the most sleep on Sundays. | 
| Actionable Steps | 
 
 | 
| Limitations | Potential users may be turned off by the high price of Time watches since the device may come with other features they are not particularly interested in | 
| Data | FitBit user data shows consistent usage of the device throughout the week, with peaks in calorie burn, intensity, and heart rate occurring around 12:00 pm and 7:00 pm. Steps taken and calories burned tend to be the lowest on Sundays and the highest on Saturdays, indicating potential opportunities to encourage wellness habits strategically. | 
| Actionable Steps | 
 | 
| Limitations | 
 | 
| Data | Those who consistently wore their FitBit device even during sleep demonstrated a sustained increase in physical activity over time compared to those who didn’t record their sleep. | 
| Actionable Steps | 
 | 
| Limitations | From the subgroup of FitBit users mentioned in section 4.8.2: Their increase in light physical activity over time may not actually be due to them wearing the FitBit device, but rather due to their inherent motivation to be healthier. This motivation may have simply just affected their choice to record every aspect of their journey, including their sleep patterns (correlation, not causation). | 
By incorporating these recommendations into its marketing strategy, Bellabeat can effectively leverage the insights from FitBit user data to position the Time watch as a comprehensive solution for tracking and optimizing users’ health, fitness, and sleep routines throughout the week.