In this case study, a junior data analyst works in marketing department at a bike-share company. The director believes the company’s success depends on maximizing the number of annual membership. The assignment for the analytic team is to analyze available dataset to answer following questions:
Let’s say, I got the following assignments:
In the following sections, I describe every executed step to achieve the objective of this case study.
For this analysis, the following libraries are used:
library(tidyverse)
library(lubridate) # To transform datetime
library(geosphere) # To calculate distance from longitudinal & latitude coordinates
For this project, I use bike-sharing dataset for year 2022. This dataset can be downloaded from here. It is made available by Lyft Bikes and Scooters under this license agreement.
After storing the data locally, he checked the credibility of the data and if the data is ROCC (reliable, original, comprehensive, current and cited).
path_trip_2022_01 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202201-divvy-tripdata.csv"
path_trip_2022_02 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202202-divvy-tripdata.csv"
path_trip_2022_03 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202203-divvy-tripdata.csv"
path_trip_2022_04 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202204-divvy-tripdata.csv"
path_trip_2022_05 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202205-divvy-tripdata.csv"
path_trip_2022_06 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202206-divvy-tripdata.csv"
path_trip_2022_07 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202207-divvy-tripdata.csv"
path_trip_2022_08 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202208-divvy-tripdata.csv"
path_trip_2022_09 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202209-divvy-publictripdata.csv"
path_trip_2022_10 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202210-divvy-tripdata.csv"
path_trip_2022_11 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202211-divvy-tripdata.csv"
path_trip_2022_12 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202212-divvy-tripdata.csv"
trip_2022_01 <- read_csv(path_trip_2022_01)
trip_2022_02 <- read_csv(path_trip_2022_02)
trip_2022_03 <- read_csv(path_trip_2022_03)
trip_2022_04 <- read_csv(path_trip_2022_04)
trip_2022_05 <- read_csv(path_trip_2022_05)
trip_2022_06 <- read_csv(path_trip_2022_06)
trip_2022_07 <- read_csv(path_trip_2022_07)
trip_2022_08 <- read_csv(path_trip_2022_08)
trip_2022_09 <- read_csv(path_trip_2022_09)
trip_2022_10 <- read_csv(path_trip_2022_10)
trip_2022_11 <- read_csv(path_trip_2022_11)
trip_2022_12 <- read_csv(path_trip_2022_12)
# For testing purpose:
path_trip_2014_07 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2014/2014_Q3Q4/Divvy_Trips_2014-Q3-07.csv"
trip_2014_07 <- read_csv(path_trip_2014_07)
colnames
for every month are
consistent.This checking is necessary before combining data from each month into one single data frame.
If output is TRUE
, then all month has identical
colnames
. If output TRUE FALSE
,
colnames
are not identical.
For testing purpose, let’s compare wrong dataset. The output
should be TRUE FALSE
.
wrong_monthly_colname_list <- c(colnames(trip_2022_01), colnames(trip_2014_07))
unique(unique(wrong_monthly_colname_list) == colnames(trip_2022_01))
## [1] TRUE FALSE
Check if monthly data for 2022 are having the same
colnames
:
monthly_colname_list <- c(colnames(trip_2022_01), colnames(trip_2022_02), colnames(trip_2022_03),
colnames(trip_2022_04), colnames(trip_2022_05), colnames(trip_2022_06),
colnames(trip_2022_07), colnames(trip_2022_08), colnames(trip_2022_09),
colnames(trip_2022_10), colnames(trip_2022_11), colnames(trip_2022_12))
unique(unique(monthly_colname_list) == colnames(trip_2022_01))
## [1] TRUE
trip_2022 <- bind_rows(trip_2022_01, trip_2022_02, trip_2022_03,
trip_2022_04, trip_2022_05, trip_2022_06,
trip_2022_07, trip_2022_08, trip_2022_09,
trip_2022_10, trip_2022_11, trip_2022_12)
Get dimension for 2022 data:
dim(trip_2022)
## [1] 5667717 13
Show colnames
:
colnames(trip_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Total rows containing missing value:
na_df <- trip_2022[rowSums(is.na(trip_2022)) > 0,]
nrow(na_df)
## [1] 1298357
Check which columns have missing values and how much:
colSums(is.na(na_df))[colSums(is.na(na_df)) > 0]
## start_station_name start_station_id end_station_name end_station_id
## 833064 833064 892742 892742
## end_lat end_lng
## 5858 5858
After seeing the finding above, it seems that the most
reliable way to identify specific station is by using
start_lat
and start_lng
for start station
(which has no missing value) and end_lat
and
end_lng
for end station (5858 missing values), instead of
using station_id
. Because there are 833064 missing
start_station_id
and only 5858 missing end_lat
and end_lng
.
Checking if the missing values for columns
end_station_id
, end_station_name
,
end_lat
and end_lng
show up always
together:
na_df[is.na(na_df$end_lat), ] %>%
select(end_station_id, end_station_name, end_lat, end_lng) %>%
unique()
## # A tibble: 1 × 4
## end_station_id end_station_name end_lat end_lng
## <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> NA NA
As can be seen, all missing values which could identify end station happen together. Hence, it is not possible to recover 5858 missing information about end station using this dataset.
Let’s go one step further if these missing value happens in specific date:
na_df[is.na(na_df$end_lat), ]$ended_at %>%
date() %>%
unique() %>%
length()
## [1] 363
It seems that the missing value for end station happens daily.
Calculate the percentage of rows with missing value indicating end station:
(nrow(na_df[is.na(na_df$end_lat), ])/nrow(trip_2022))*100
## [1] 0.1033573
Only a very small portion of data i.e. 0.1 % has missing information about the end station. Hence, in the next section these data were removed.
Total row after removal should be: 5667717 - 5858 = 5661859
cleaned_data <- trip_2022[!is.na(trip_2022$end_lat),]
nrow(cleaned_data)
## [1] 5661859
start_station_name
, start_station_id
,
end_station_name
, end_station_id
cleaned_data <- cleaned_data %>%
select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
colnames(cleaned_data)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual"
colSums(is.na(cleaned_data))
## ride_id rideable_type started_at ended_at start_lat
## 0 0 0 0 0
## start_lng end_lat end_lng member_casual
## 0 0 0 0
Rename following columns into: - rideable_type
->
bike_type
- member_casual
->
user_type
Rename following values from column user_type
: -
casual
-> non-member
Renaming column and its value is necessary for the sake of clarity.
trip_2022 <- trip_2022 %>%
rename(bike_type = rideable_type,
user_type = member_casual)
trip_2022 <- trip_2022 %>%
mutate(user_type=recode(user_type,
"casual"="non-member"))
duration
Calculate time difference:
duration <- cleaned_data$ended_at - cleaned_data$started_at
head(duration)
## Time differences in secs
## [1] 177 261 261 896 362 202
The duration is still in seconds. Convert duration to minutes and round it to two:
duration <- round(make_difftime(duration, units="minute"), 2)
head(duration)
## Time differences in mins
## [1] 2.95 4.35 4.35 14.93 6.03 3.37
Insert duration into cleaned_data
:
cleaned_data <- mutate(cleaned_data, duration)
head(select(cleaned_data, c(started_at, ended_at, duration)))
## # A tibble: 6 × 3
## started_at ended_at duration
## <dttm> <dttm> <drtn>
## 1 2022-01-13 11:59:47 2022-01-13 12:02:44 2.95 mins
## 2 2022-01-10 08:41:56 2022-01-10 08:46:17 4.35 mins
## 3 2022-01-25 04:53:40 2022-01-25 04:58:01 4.35 mins
## 4 2022-01-04 00:18:04 2022-01-04 00:33:00 14.93 mins
## 5 2022-01-20 01:31:10 2022-01-20 01:37:12 6.03 mins
## 6 2022-01-11 18:48:09 2022-01-11 18:51:31 3.37 mins
duration
min(cleaned_data$duration); mean(cleaned_data$duration); median(cleaned_data$duration); max(cleaned_data$duration)
## Time difference of -10353.35 mins
## Time difference of 16.32754 mins
## Time difference of 10.27 mins
## Time difference of 34354.07 mins
It seems that there are some faulty values, because the minimum
duration
is a negative value.
duration
Total faulty nrow
for duration
:
nrow(cleaned_data[cleaned_data$duration < 0, ])
## [1] 100
Display head
of faulty duration
:
head(select(cleaned_data[cleaned_data$duration < 0, ], c(started_at, ended_at, duration)))
## # A tibble: 6 × 3
## started_at ended_at duration
## <dttm> <dttm> <drtn>
## 1 2022-03-05 11:00:57 2022-03-05 10:55:01 -5.93 mins
## 2 2022-03-05 11:38:04 2022-03-05 11:37:57 -0.12 mins
## 3 2022-05-30 11:06:29 2022-05-30 11:06:17 -0.20 mins
## 4 2022-06-07 19:15:39 2022-06-07 17:05:37 -130.03 mins
## 5 2022-06-07 19:14:46 2022-06-07 17:07:45 -127.02 mins
## 6 2022-06-23 19:22:57 2022-06-23 19:21:46 -1.18 mins
This faulty data can be removed from the dataset or repaired by
switching the value for started_at
and
ended_at
. For the next section, I decided to
repair the faulty dataset.
Switching values for started_at
and ended_at
for fault rows:
old_start_end <- select(cleaned_data[cleaned_data$duration < 0, ], c(started_at, ended_at))
cleaned_data[cleaned_data$duration < 0, c("ended_at", "started_at")] <- old_start_end
Checking if the values were correctly switched:
cleaned_data[cleaned_data$duration < 0, c("started_at", "ended_at", "duration")]
## # A tibble: 100 × 3
## started_at ended_at duration
## <dttm> <dttm> <drtn>
## 1 2022-03-05 10:55:01 2022-03-05 11:00:57 -5.93 mins
## 2 2022-03-05 11:37:57 2022-03-05 11:38:04 -0.12 mins
## 3 2022-05-30 11:06:17 2022-05-30 11:06:29 -0.20 mins
## 4 2022-06-07 17:05:37 2022-06-07 19:15:39 -130.03 mins
## 5 2022-06-07 17:07:45 2022-06-07 19:14:46 -127.02 mins
## 6 2022-06-23 19:21:46 2022-06-23 19:22:57 -1.18 mins
## 7 2022-06-07 17:05:42 2022-06-07 19:14:47 -129.08 mins
## 8 2022-06-07 16:07:28 2022-06-07 16:18:37 -11.15 mins
## 9 2022-06-07 17:05:41 2022-06-07 18:47:01 -101.33 mins
## 10 2022-06-07 17:05:24 2022-06-07 19:11:33 -126.15 mins
## # … with 90 more rows
The values were correctly switched. Now the duration
should be recalculated:
duration <- cleaned_data$ended_at - cleaned_data$started_at
duration <- round(make_difftime(duration, units="minute"), 2)
cleaned_data[, "duration"] <- duration
head(cleaned_data[, c("started_at", "ended_at", "duration")])
## # A tibble: 6 × 3
## started_at ended_at duration
## <dttm> <dttm> <drtn>
## 1 2022-01-13 11:59:47 2022-01-13 12:02:44 2.95 mins
## 2 2022-01-10 08:41:56 2022-01-10 08:46:17 4.35 mins
## 3 2022-01-25 04:53:40 2022-01-25 04:58:01 4.35 mins
## 4 2022-01-04 00:18:04 2022-01-04 00:33:00 14.93 mins
## 5 2022-01-20 01:31:10 2022-01-20 01:37:12 6.03 mins
## 6 2022-01-11 18:48:09 2022-01-11 18:51:31 3.37 mins
Checking if there is still negative duration
:
min(cleaned_data$duration); mean(cleaned_data$duration); median(cleaned_data$duration); max(cleaned_data$duration)
## Time difference of 0 mins
## Time difference of 16.33219 mins
## Time difference of 10.27 mins
## Time difference of 34354.07 mins
Checking the data:
summary(cleaned_data)
## ride_id rideable_type started_at
## Length:5661859 Length:5661859 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 19:17:08.50
## Mode :character Mode :character Median :2022-07-22 15:11:15.00
## Mean :2022-07-20 07:27:27.55
## 3rd Qu.:2022-09-16 07:38:28.00
## Max. :2022-12-31 23:59:26.00
## ended_at start_lat start_lng
## Min. :2022-01-01 00:01:48.00 Min. :41.64 Min. :-87.84
## 1st Qu.:2022-05-28 19:37:21.00 1st Qu.:41.88 1st Qu.:-87.66
## Median :2022-07-22 15:30:24.00 Median :41.90 Median :-87.64
## Mean :2022-07-20 07:43:47.49 Mean :41.90 Mean :-87.65
## 3rd Qu.:2022-09-16 07:51:35.00 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :2023-01-01 18:09:37.00 Max. :45.64 Max. :-73.80
## end_lat end_lng member_casual duration
## Min. : 0.00 Min. :-88.14 Length:5661859 Length:5661859
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character Class :difftime
## Median :41.90 Median :-87.64 Mode :character Mode :numeric
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
unique(cleaned_data$rideable_type); unique(cleaned_data$user_type)
## [1] "electric_bike" "classic_bike" "docked_bike"
## NULL
Check how many stations there are:
unique(cleaned_data[, c("start_lat", "start_lng")])
## # A tibble: 1,952,378 × 2
## start_lat start_lng
## <dbl> <dbl>
## 1 42.0 -87.7
## 2 42.0 -87.7
## 3 41.9 -87.7
## 4 42.0 -87.7
## 5 41.9 -87.6
## 6 41.9 -87.7
## 7 42.0 -87.7
## 8 41.9 -87.7
## 9 41.9 -87.6
## 10 42.0 -87.7
## # … with 1,952,368 more rows
The previous result is quite shocking, because it seems that there
were 1.952.378 different start-stations while the total trips in that
year were 5.661.859. It shows that I have used a wrong assumption that each
station has a unique coordinate (_lat
and
_lng
). It seems that the coordinate are for start- and
end-bike-position and not station’s
coordinate.
Let’s check how many stations are there:
length(unique(trip_2022$start_station_id)); length(unique(trip_2022$end_station_id)); length(unique(trip_2022$start_station_name)); length(unique(trip_2022$end_station_name));
## [1] 1314
## [1] 1318
## [1] 1675
## [1] 1693
length(unique(c(trip_2022$start_station_id, trip_2022$end_station_id)));
## [1] 1323
length(unique(c(trip_2022$start_station_name, trip_2022$end_station_name)))
## [1] 1710
Checking back missing value of original data:
colSums(is.na(na_df))[colSums(is.na(na_df)) > 0]
## start_station_name start_station_id end_station_name end_station_id
## 833064 833064 892742 892742
## end_lat end_lng
## 5858 5858
Summary for missing values:
end_lat
and
end_lng
) -> 5858/5667717 * 100%: 0.1%It seems removing rows with missing value is not an option, because removing 15.8% rows is just too much. Therefore, I use different subset of data for different analysis, i.e.:
_lat
and _lng
values is used,station_name
is used.To be able to perform that, I went back to original data and perform cleaning without removing any columns.
In this section, the same data transformation process from section 2.4-3 were performed to original data with difference: no columns and rows are removed. This step is necessary due to the wrong assumption mentioned in last section.
Create column duration
and get nrow
for
duration < 0:
duration_min = trip_2022$ended_at - trip_2022$started_at
duration_min <- round(make_difftime(duration_min, units="minute"), 2)
trip_2022_modified <- trip_2022 %>% mutate(duration_min = duration_min)
nrow(trip_2022_modified[(trip_2022_modified$duration_min) < 0, ])
## [1] 100
Repair row with negative duration and get nrow
for
duration < 0:
# switching two columns ("ended_at", "started_at") for rows with negative duration:
old_start_end <- select(trip_2022_modified[trip_2022_modified$duration_min < 0, ], c(started_at, ended_at))
trip_2022_modified[trip_2022_modified$duration_min < 0, c("ended_at", "started_at")] <- old_start_end
# calculate new duration for rows with negative duration:
trip_2022_modified$duration_min <- trip_2022_modified$ended_at - trip_2022_modified$started_at
trip_2022_modified$duration_min <- as.numeric(round(make_difftime(trip_2022_modified$duration_min, units="minute"), 2))
# check how many rows has negative duration. At this point, it should be 0.
nrow(trip_2022_modified[(trip_2022_modified$duration_min) < 0, ])
## [1] 0
colnames(trip_2022_modified)
## [1] "ride_id" "bike_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "user_type" "duration_min"
In this section, the data are visualized using different parameters. I also presented my findings which start with \(\color{blue}{\text{->}}\) symbol.
distance
distance_m = distHaversine(trip_2022_modified[,c('start_lng','start_lat')],
trip_2022_modified[,c('end_lng','end_lat')])
trip_2022_modified <- trip_2022_modified %>% mutate(distance_m=distance_m)
kmh
trip_2022_modified <- trip_2022_modified %>% mutate(speed_kmh=round((distance_m*60)/(1000*duration_min), 2))
summary(trip_2022_modified[, c('distance_m', 'speed_kmh')])
## distance_m speed_kmh
## Min. : 0 Min. : 0.00
## 1st Qu.: 874 1st Qu.: 7.24
## Median : 1576 Median :10.61
## Mean : 2142 Mean : Inf
## 3rd Qu.: 2783 3rd Qu.:13.80
## Max. :9825063 Max. : Inf
## NA's :5858 NA's :5962
Create subset for distance related analysis:
subset_clean_distance <- trip_2022_modified[!(is.na(trip_2022_modified$distance_m)), ]
summary(subset_clean_distance[, c('distance_m', 'speed_kmh')])
## distance_m speed_kmh
## Min. : 0 Min. : 0.00
## 1st Qu.: 874 1st Qu.: 7.24
## Median : 1576 Median :10.61
## Mean : 2142 Mean : Inf
## 3rd Qu.: 2783 3rd Qu.:13.80
## Max. :9825063 Max. : Inf
## NA's :104
Find faulty data:
speed_sorted <- unique(arrange(subset_clean_distance, desc(speed_kmh))$speed_kmh)
head(speed_sorted, 20)
## [1] Inf 2357738.52 421024.67 134580.42 127573.59 105621.53
## [7] 76241.93 65348.55 28854.81 27694.29 3846.29 3647.58
## [13] 2489.58 2485.69 2290.83 2202.49 2169.14 1964.01
## [19] 1963.99 1947.04
As can be seen above, there are still many faulty data. It is doubtful, that a bike can reach 2357738 kmh. Therefore, it makes sense to only use data with speed < 50 kmh.
subset_clean_distance <- subset(subset_clean_distance, speed_kmh <= 50)
# Round the speed_kmh to scale down the data. Necessary step, otherwise the graph would differentiate data for 5.02 and 5.24 kmh, instead of 5 kmh for both values.
subset_clean_distance[, "speed_kmh"] <- round(subset_clean_distance[, "speed_kmh"], 0)
summary(subset_clean_distance)
## ride_id bike_type started_at
## Length:5650877 Length:5650877 Min. :2022-01-01 00:00:05.00
## Class :character Class :character 1st Qu.:2022-05-28 18:55:57.00
## Mode :character Mode :character Median :2022-07-22 14:50:19.00
## Mean :2022-07-20 06:59:53.77
## 3rd Qu.:2022-09-16 07:28:17.00
## Max. :2022-12-31 23:59:26.00
## ended_at start_station_name start_station_id
## Min. :2022-01-01 00:01:48.00 Length:5650877 Length:5650877
## 1st Qu.:2022-05-28 19:15:54.00 Class :character Class :character
## Median :2022-07-22 15:08:46.00 Mode :character Mode :character
## Mean :2022-07-20 07:16:15.57
## 3rd Qu.:2022-09-16 07:43:19.00
## Max. :2023-01-01 18:09:37.00
## end_station_name end_station_id start_lat start_lng
## Length:5650877 Length:5650877 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
## end_lat end_lng user_type duration_min
## Min. :41.55 Min. :-88.14 Length:5650877 Min. : 0.02
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 5.83
## Median :41.90 Median :-87.64 Mode :character Median : 10.30
## Mean :41.90 Mean :-87.65 Mean : 16.36
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 18.45
## Max. :42.37 Max. :-87.30 Max. :34354.07
## distance_m speed_kmh
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 876.6 1st Qu.: 7.00
## Median : 1580.0 Median :11.00
## Mean : 2130.9 Mean :10.35
## 3rd Qu.: 2787.0 3rd Qu.:14.00
## Max. :42319.5 Max. :50.00
speed_kmh
# Count how often is every combination of "speed_kmh" and "bike_type" occurs.
count_bike_type_speed <- subset_clean_distance %>%
count(bike_type, speed_kmh)
# Count how often does each speed value occur
count_speed <- subset_clean_distance %>%
count(speed_kmh)
# Merge the values from both last steps into a table (as preparation for next step)
count_bike_type_speed <- count_bike_type_speed %>%
merge(count_speed, by = "speed_kmh") %>%
rename(count=n.x, total=n.y)
# Calculate the distribution by dividing first value with second value from previous steps.
count_bike_type_speed <- count_bike_type_speed %>%
mutate(rel_count=(count/total)*100)
head(count_bike_type_speed)
## speed_kmh bike_type count total rel_count
## 1 0 classic_bike 177054 415109 42.652412
## 2 0 docked_bike 39535 415109 9.524005
## 3 0 electric_bike 198520 415109 47.823584
## 4 1 classic_bike 49705 116770 42.566584
## 5 1 electric_bike 52844 116770 45.254774
## 6 1 docked_bike 14221 116770 12.178642
ggplot(count_bike_type_speed) +
geom_bar(aes(x=speed_kmh, y=rel_count, fill=bike_type), stat="identity") +
scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))
count_bike_type_speed %>% group_by(bike_type) %>%
summarize(mean(speed_kmh), median(speed_kmh), sd(speed_kmh))
## # A tibble: 3 × 4
## bike_type `mean(speed_kmh)` `median(speed_kmh)` `sd(speed_kmh)`
## <chr> <dbl> <dbl> <dbl>
## 1 classic_bike 21.9 21 13.9
## 2 docked_bike 13.2 12.5 9.31
## 3 electric_bike 25 25 14.9
\(\color{blue}{\text{->}}\) The graph and table above show that electric bike user bikes 4 kmh faster than classic bike, while docked bike user is almost half as fast as electric.
# Count how often is every combination of "speed_kmh" and "user_type" occurs.
count_user_type_speed <- subset_clean_distance %>%
count(user_type, speed_kmh)
# Merge the table with total speed count (as preparation for next step)
count_user_type_speed <- count_user_type_speed %>%
merge(count_speed, by = "speed_kmh") %>%
rename(count=n.x, total=n.y)
# Calculate the distribution by dividing first value with second value from previous steps.
count_user_type_speed <- count_user_type_speed %>%
mutate(rel_count=(count/total)*100)
ggplot(count_user_type_speed) +
geom_bar(aes(x=speed_kmh, y=rel_count, fill=user_type), stat="identity") +
scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))
subset_clean_distance %>% group_by(user_type) %>%
summarize(mean(speed_kmh), median(speed_kmh), sd(speed_kmh))
## # A tibble: 2 × 4
## user_type `mean(speed_kmh)` `median(speed_kmh)` `sd(speed_kmh)`
## <chr> <dbl> <dbl> <dbl>
## 1 member 11.3 11 5.17
## 2 non-member 8.96 9 5.66
\(\color{blue}{\text{->}}\) The graphic and table above show that non-member biker rides a bit slower than member user.
bike_type
# Calculating relative distribution
rel_dist <- trip_2022_modified %>% count(bike_type, user_type)
rel_dist <- rel_dist %>% group_by(bike_type) %>% mutate(total=sum(n))
# Calculating text position for next visualization
rel_dist <- rel_dist %>% arrange(desc(n)) %>%
mutate(rel_freq = (n/total)*100, pos = cumsum(n) - (0.5 * n)) %>%
rename(count=n)
rel_dist
## # A tibble: 5 × 6
## # Groups: bike_type [3]
## bike_type user_type count total rel_freq pos
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 classic_bike member 1709755 2601214 65.7 854878.
## 2 electric_bike member 1635930 2889029 56.6 817965
## 3 electric_bike non-member 1253099 2889029 43.4 2262480.
## 4 classic_bike non-member 891459 2601214 34.3 2155484.
## 5 docked_bike non-member 177474 177474 100 88737
ggplot() +
geom_bar(data=rel_dist, aes(x=bike_type, y=count, fill=user_type), stat="identity") +
geom_text(data=rel_dist, aes(x=bike_type, y=pos, label=paste(round(rel_freq, 0), "%", sep=" ")),
vjust=1,color="white", position = position_dodge(width=1), size=4)
trip_2022_modified %>% group_by(bike_type) %>%
summarize(mean(duration_min), median(duration_min), sd(duration_min))
## # A tibble: 3 × 4
## bike_type `mean(duration_min)` `median(duration_min)` `sd(duration_min)`
## <chr> <dbl> <dbl> <dbl>
## 1 classic_bike 19.0 10.9 61.8
## 2 docked_bike 123. 28.0 958.
## 3 electric_bike 13.5 9.33 17.2
\(\color{blue}{\text{->}}\) The table above shows that average usage duration for docked bike is much longer and interestingly the graph above shows it is being rented much less than other bike types.
duration
subset_duration <- trip_2022_modified %>% select(bike_type, user_type, duration_min, distance_m)
# Round the duration_min to scale down the data. Necessary step, otherwise the graph would differentiate data for 10.02 and 10.1 minutes, instead of 10 minutes for both values.
subset_duration[, "duration_min"] <- round(subset_duration[, "duration_min"], 0)
summary(subset_duration)
## bike_type user_type duration_min distance_m
## Length:5667717 Length:5667717 Min. : 0.00 Min. : 0
## Class :character Class :character 1st Qu.: 6.00 1st Qu.: 874
## Mode :character Mode :character Median : 10.00 Median : 1576
## Mean : 19.45 Mean : 2142
## 3rd Qu.: 18.00 3rd Qu.: 2783
## Max. :41387.00 Max. :9825063
## NA's :5858
\(\color{blue}{\text{->}}\) Mean usage duration is almost 20 minutes.
subset_duration %>% subset(duration_min < 120) %>%
ggplot() + geom_bar(aes(x=duration_min, fill=user_type)) +
scale_y_log10()
Let’s create a function to visualize relative distribution:
vis_rel_dist <- function(data, param, fill_param) {
# function to visualize relative distribution with param as x axis and fill_param as fill.
rel_dist <- data %>% count({{ param }}, {{ fill_param }})
rel_dist <- rel_dist %>% group_by({{ param }}) %>% mutate(total=sum(n))
rel_dist <- rel_dist %>% mutate(rel_freq = (n/total)*100)
ggplot(rel_dist) +
geom_bar(aes(x={{ param }}, y=rel_freq, fill={{ fill_param }}), stat="identity")
}
vis_rel_dist(subset(subset_duration, duration_min < 120), duration_min, user_type) +
scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))
It seems using duration in minute is not practical to show general daily trend. Let’s transform the data to visualize duration in hour:
subset_duration <- subset_duration %>% mutate(duration_hour = round((duration_min/60), 0))
Visualize duration_hour
:
subset_duration %>% subset(duration_hour < 48) %>%
ggplot() + geom_bar(aes(x=duration_hour, fill=user_type)) +
scale_y_log10()
subset_duration %>% group_by(user_type) %>%
summarize(mean(duration_min), median(duration_min), sd(duration_min))
## # A tibble: 2 × 4
## user_type `mean(duration_min)` `median(duration_min)` `sd(duration_min)`
## <chr> <dbl> <dbl> <dbl>
## 1 member 12.7 9 29.4
## 2 non-member 29.1 13 273.
\(\color{blue}{\text{->}}\) As can be seen from the graph and table above, non-member user rents bike with longer duration than user with membership.
Let’s analyze bike rent which > 24 hours:
subset_duration_24h <- subset_duration %>% subset(duration_hour >24)
print(paste("Total amount of rent duration 24 hours is", nrow(subset_duration_24h), "or", round(nrow(subset_duration_24h)/nrow(subset_duration) * 100, 2), "% of all order."))
## [1] "Total amount of rent duration 24 hours is 5311 or 0.09 % of all order."
subset_duration_24h %>%
count(user_type) %>%
mutate(relative_frequency=round((n/sum(n))*100))
## # A tibble: 2 × 3
## user_type n relative_frequency
## <chr> <int> <dbl>
## 1 member 701 13
## 2 non-member 4610 87
\(\color{blue}{\text{->}}\) The table shows that 87% for all rent which longer than 24 hours are done by non-member user. This might be our potential campaign target. Since the dataset doesn’t give information, if these customers are one-time or frequent users, it is unknown, how much of 4610 non-member user rent frequently.
subset_duration_24h %>%
filter(user_type == "non-member") %>%
count(bike_type) %>%
mutate(relative_frequency=round((n/sum(n))*100))
## # A tibble: 2 × 3
## bike_type n relative_frequency
## <chr> <int> <dbl>
## 1 classic_bike 2593 56
## 2 docked_bike 2017 44
\(\color{blue}{\text{->}}\) Non-member biker who rents longer than 24 hours prefers classic bike.
colnames(subset_duration)
## [1] "bike_type" "user_type" "duration_min" "distance_m"
## [5] "duration_hour"
subset_duration %>%
filter(user_type == "non-member") %>%
group_by(bike_type) %>%
drop_na(distance_m, duration_min) %>%
summarize(mean_distance=mean(distance_m), mean_duration_minute=mean(duration_min))
## # A tibble: 3 × 3
## bike_type mean_distance mean_duration_minute
## <chr> <dbl> <dbl>
## 1 classic_bike 2087. 24.5
## 2 docked_bike 2176. 50.7
## 3 electric_bike 2253. 16.2
user_type
# Preparing data for visualization
# pos value is for coordinate position used in the next ggplot visualization
pos_1 <-trip_2022_modified %>%
count(user_type)%>%
mutate(rel_n = round(((n/sum(n))*100), 0))
pos_2 <- trip_2022_modified %>%
count(user_type, bike_type)
pos_2 <- pos_2 %>%
group_by(user_type) %>%
mutate(total=sum(n)) %>%
mutate(rel_freq=round((n/total)*100))
pos_2 <- pos_2 %>%
arrange(user_type, desc(bike_type)) %>%
mutate(pos=cumsum(n) - (0.5 * n))
pos_2
## # A tibble: 5 × 6
## # Groups: user_type [2]
## user_type bike_type n total rel_freq pos
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 member electric_bike 1635930 3345685 49 817965
## 2 member classic_bike 1709755 3345685 51 2490808.
## 3 non-member electric_bike 1253099 2322032 54 626550.
## 4 non-member docked_bike 177474 2322032 8 1341836
## 5 non-member classic_bike 891459 2322032 38 1876302.
# Visualize with data from pos_1 and pos_2
ggplot() +
geom_bar(data = trip_2022_modified, aes(x=user_type, fill=bike_type)) +
geom_text(data=pos_1, aes(x=user_type, y=n, label=paste(round(rel_n, 0), "%", sep=" ")),
vjust=-0.5,color="black", position = position_dodge(width=1), size=4) +
geom_text(data=pos_2, aes(x=user_type, y=pos, label=paste(round(rel_freq, 0), "%", sep=" ")),
vjust=0.5,color="white", position = position_dodge(width=1), size=4)
\(\color{blue}{\text{->}}\) About 60% of rent was generated by member user. Non-member user prefers renting electric bike. Furthermore, all docked bike is only rented by non-member biker.
trip_2022_modified %>% group_by(user_type) %>%
drop_na() %>%
summarize(mean(distance_m), median(distance_m), sd(distance_m))
## # A tibble: 2 × 4
## user_type `mean(distance_m)` `median(distance_m)` `sd(distance_m)`
## <chr> <dbl> <dbl> <dbl>
## 1 member 2071. 1499. 14997.
## 2 non-member 2163. 1651. 10695.
\(\color{blue}{\text{->}}\) Non-member and member users bike in average around 2 kilometers.
trip_2022_modified %>% group_by(bike_type) %>%
drop_na() %>%
summarize(mean(distance_m), median(distance_m), sd(distance_m))
## # A tibble: 3 × 4
## bike_type `mean(distance_m)` `median(distance_m)` `sd(distance_m)`
## <chr> <dbl> <dbl> <dbl>
## 1 classic_bike 2007. 1471. 16221.
## 2 docked_bike 2176. 1601. 2288.
## 3 electric_bike 2265. 1716. 8067.
\(\color{blue}{\text{->}}\) The difference in biking distance for different bike types is very minimal (around 10%).
station_name <- unlist(c(list(trip_2022_modified$start_station_name), list(trip_2022_modified$end_station_name)), recursive = FALSE)
print(paste("There are", length(unique(station_name)), "different stations."))
## [1] "There are 1710 different stations."
station_df <- data.frame(station_name) %>% count(station_name) %>% arrange(desc(n)) %>% mutate(rel_frequency = round((n/sum(n))*100, 2))
head(station_df)
## station_name n rel_frequency
## 1 <NA> 1725806 15.22
## 2 Streeter Dr & Grand Ave 150619 1.33
## 3 DuSable Lake Shore Dr & North Blvd 82231 0.73
## 4 DuSable Lake Shore Dr & Monroe St 81404 0.72
## 5 Michigan Ave & Oak St 79788 0.70
## 6 Wells St & Concord Ln 74936 0.66
\(\color{blue}{\text{->}}\) It seems that there is many missing station name in dataset. The table above also shows favourite station names.
Based on findings from section 6, I recommend the following points as focus for marketing campaign:
—– The End —-