Introduction

In this case study, a junior data analyst works in marketing department at a bike-share company. The director believes the company’s success depends on maximizing the number of annual membership. The assignment for the analytic team is to analyze available dataset to answer following questions:

  • How do annual members and casual riders use rental bike differently?
  • Why would casual riders buy annual memberships?
  • How can digital media used to influence casual riders to become members?

Let’s say, I got the following assignments:

  • answer the first question,
  • document all necessary analytic steps in this documentation,
  • document and visualize the key findings,
  • make three recommendations based on his findings.

In the following sections, I describe every executed step to achieve the objective of this case study.

For this analysis, the following libraries are used:

library(tidyverse)
library(lubridate) # To transform datetime
library(geosphere) # To calculate distance from longitudinal & latitude coordinates

1. Data Collection & Preparation.

For this project, I use bike-sharing dataset for year 2022. This dataset can be downloaded from here. It is made available by Lyft Bikes and Scooters under this license agreement.

After storing the data locally, he checked the credibility of the data and if the data is ROCC (reliable, original, comprehensive, current and cited).

1.1. Opening the data.

path_trip_2022_01 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202201-divvy-tripdata.csv"
path_trip_2022_02 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202202-divvy-tripdata.csv"
path_trip_2022_03 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202203-divvy-tripdata.csv"
path_trip_2022_04 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202204-divvy-tripdata.csv"
path_trip_2022_05 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202205-divvy-tripdata.csv"
path_trip_2022_06 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202206-divvy-tripdata.csv"
path_trip_2022_07 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202207-divvy-tripdata.csv"
path_trip_2022_08 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202208-divvy-tripdata.csv"
path_trip_2022_09 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202209-divvy-publictripdata.csv"
path_trip_2022_10 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202210-divvy-tripdata.csv"
path_trip_2022_11 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202211-divvy-tripdata.csv"
path_trip_2022_12 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2022/202212-divvy-tripdata.csv"

trip_2022_01 <- read_csv(path_trip_2022_01)
trip_2022_02 <- read_csv(path_trip_2022_02)
trip_2022_03 <- read_csv(path_trip_2022_03)
trip_2022_04 <- read_csv(path_trip_2022_04)
trip_2022_05 <- read_csv(path_trip_2022_05)
trip_2022_06 <- read_csv(path_trip_2022_06)
trip_2022_07 <- read_csv(path_trip_2022_07)
trip_2022_08 <- read_csv(path_trip_2022_08)
trip_2022_09 <- read_csv(path_trip_2022_09)
trip_2022_10 <- read_csv(path_trip_2022_10)
trip_2022_11 <- read_csv(path_trip_2022_11)
trip_2022_12 <- read_csv(path_trip_2022_12)

# For testing purpose:
path_trip_2014_07 <- "/Users/gunardiali/Documents/R/Bike_Share_Dataset/2014/2014_Q3Q4/Divvy_Trips_2014-Q3-07.csv"
trip_2014_07 <- read_csv(path_trip_2014_07)

1.2. Checking if colnames for every month are consistent.

This checking is necessary before combining data from each month into one single data frame.

If output is TRUE, then all month has identical colnames. If output TRUE FALSE, colnames are not identical.

For testing purpose, let’s compare wrong dataset. The output should be TRUE FALSE.

wrong_monthly_colname_list <- c(colnames(trip_2022_01), colnames(trip_2014_07))

unique(unique(wrong_monthly_colname_list) == colnames(trip_2022_01))
## [1]  TRUE FALSE

Check if monthly data for 2022 are having the same colnames:

monthly_colname_list <- c(colnames(trip_2022_01), colnames(trip_2022_02), colnames(trip_2022_03), 
                          colnames(trip_2022_04), colnames(trip_2022_05), colnames(trip_2022_06), 
                          colnames(trip_2022_07), colnames(trip_2022_08), colnames(trip_2022_09), 
                          colnames(trip_2022_10), colnames(trip_2022_11), colnames(trip_2022_12))

unique(unique(monthly_colname_list) == colnames(trip_2022_01))
## [1] TRUE

1.3. Combining monthly data into one data frame

trip_2022 <- bind_rows(trip_2022_01, trip_2022_02, trip_2022_03,
                       trip_2022_04, trip_2022_05, trip_2022_06,
                       trip_2022_07, trip_2022_08, trip_2022_09,
                       trip_2022_10, trip_2022_11, trip_2022_12)

Get dimension for 2022 data:

dim(trip_2022)
## [1] 5667717      13

Show colnames:

colnames(trip_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

1.4. Checking missing value

Total rows containing missing value:

na_df <- trip_2022[rowSums(is.na(trip_2022)) > 0,]
nrow(na_df)
## [1] 1298357

Check which columns have missing values and how much:

colSums(is.na(na_df))[colSums(is.na(na_df)) > 0]
## start_station_name   start_station_id   end_station_name     end_station_id 
##             833064             833064             892742             892742 
##            end_lat            end_lng 
##               5858               5858

After seeing the finding above, it seems that the most reliable way to identify specific station is by using start_lat and start_lng for start station (which has no missing value) and end_lat and end_lng for end station (5858 missing values), instead of using station_id. Because there are 833064 missing start_station_id and only 5858 missing end_lat and end_lng.

Checking if the missing values for columns end_station_id, end_station_name, end_lat and end_lng show up always together:

na_df[is.na(na_df$end_lat), ] %>%
  select(end_station_id, end_station_name, end_lat, end_lng) %>% 
  unique()
## # A tibble: 1 × 4
##   end_station_id end_station_name end_lat end_lng
##   <chr>          <chr>              <dbl>   <dbl>
## 1 <NA>           <NA>                  NA      NA

As can be seen, all missing values which could identify end station happen together. Hence, it is not possible to recover 5858 missing information about end station using this dataset.

Let’s go one step further if these missing value happens in specific date:

na_df[is.na(na_df$end_lat), ]$ended_at %>% 
  date() %>% 
  unique() %>% 
  length()
## [1] 363

It seems that the missing value for end station happens daily.

Calculate the percentage of rows with missing value indicating end station:

(nrow(na_df[is.na(na_df$end_lat), ])/nrow(trip_2022))*100
## [1] 0.1033573

Only a very small portion of data i.e. 0.1 % has missing information about the end station. Hence, in the next section these data were removed.

2. Data Transformation

2.1. Removing rows with missing end station

Total row after removal should be: 5667717 - 5858 = 5661859

cleaned_data <- trip_2022[!is.na(trip_2022$end_lat),]
nrow(cleaned_data)
## [1] 5661859

2.2. Removing following columns:

start_station_name, start_station_id, end_station_name, end_station_id

cleaned_data <- cleaned_data %>% 
  select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
colnames(cleaned_data)
## [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
## [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
## [9] "member_casual"

2.3. Checking if there is still missing value

colSums(is.na(cleaned_data))
##       ride_id rideable_type    started_at      ended_at     start_lat 
##             0             0             0             0             0 
##     start_lng       end_lat       end_lng member_casual 
##             0             0             0             0

2.4. Rename columns and column values

Rename following columns into: - rideable_type -> bike_type - member_casual-> user_type

Rename following values from column user_type: - casual -> non-member

Renaming column and its value is necessary for the sake of clarity.

trip_2022 <- trip_2022 %>%
  rename(bike_type = rideable_type,
        user_type = member_casual)

trip_2022 <- trip_2022 %>% 
  mutate(user_type=recode(user_type, 
                          "casual"="non-member"))

2.5. Create a column for trip duration

Calculate time difference:

duration <- cleaned_data$ended_at - cleaned_data$started_at
head(duration)
## Time differences in secs
## [1] 177 261 261 896 362 202

The duration is still in seconds. Convert duration to minutes and round it to two:

duration <- round(make_difftime(duration, units="minute"), 2)
head(duration)
## Time differences in mins
## [1]  2.95  4.35  4.35 14.93  6.03  3.37

Insert duration into cleaned_data:

cleaned_data <- mutate(cleaned_data, duration)
head(select(cleaned_data, c(started_at, ended_at, duration)))
## # A tibble: 6 × 3
##   started_at          ended_at            duration  
##   <dttm>              <dttm>              <drtn>    
## 1 2022-01-13 11:59:47 2022-01-13 12:02:44  2.95 mins
## 2 2022-01-10 08:41:56 2022-01-10 08:46:17  4.35 mins
## 3 2022-01-25 04:53:40 2022-01-25 04:58:01  4.35 mins
## 4 2022-01-04 00:18:04 2022-01-04 00:33:00 14.93 mins
## 5 2022-01-20 01:31:10 2022-01-20 01:37:12  6.03 mins
## 6 2022-01-11 18:48:09 2022-01-11 18:51:31  3.37 mins

2.6. Checking duration

min(cleaned_data$duration); mean(cleaned_data$duration); median(cleaned_data$duration); max(cleaned_data$duration)
## Time difference of -10353.35 mins
## Time difference of 16.32754 mins
## Time difference of 10.27 mins
## Time difference of 34354.07 mins

It seems that there are some faulty values, because the minimum duration is a negative value.

2.7. Checking faulty duration

Total faulty nrow for duration:

nrow(cleaned_data[cleaned_data$duration < 0, ])
## [1] 100

Display head of faulty duration:

head(select(cleaned_data[cleaned_data$duration < 0, ], c(started_at, ended_at, duration)))
## # A tibble: 6 × 3
##   started_at          ended_at            duration    
##   <dttm>              <dttm>              <drtn>      
## 1 2022-03-05 11:00:57 2022-03-05 10:55:01   -5.93 mins
## 2 2022-03-05 11:38:04 2022-03-05 11:37:57   -0.12 mins
## 3 2022-05-30 11:06:29 2022-05-30 11:06:17   -0.20 mins
## 4 2022-06-07 19:15:39 2022-06-07 17:05:37 -130.03 mins
## 5 2022-06-07 19:14:46 2022-06-07 17:07:45 -127.02 mins
## 6 2022-06-23 19:22:57 2022-06-23 19:21:46   -1.18 mins

This faulty data can be removed from the dataset or repaired by switching the value for started_at and ended_at. For the next section, I decided to repair the faulty dataset.

3. Repairing dataset

Switching values for started_atand ended_at for fault rows:

old_start_end <- select(cleaned_data[cleaned_data$duration < 0, ], c(started_at, ended_at))

cleaned_data[cleaned_data$duration < 0, c("ended_at", "started_at")] <- old_start_end

Checking if the values were correctly switched:

cleaned_data[cleaned_data$duration < 0, c("started_at", "ended_at", "duration")]
## # A tibble: 100 × 3
##    started_at          ended_at            duration    
##    <dttm>              <dttm>              <drtn>      
##  1 2022-03-05 10:55:01 2022-03-05 11:00:57   -5.93 mins
##  2 2022-03-05 11:37:57 2022-03-05 11:38:04   -0.12 mins
##  3 2022-05-30 11:06:17 2022-05-30 11:06:29   -0.20 mins
##  4 2022-06-07 17:05:37 2022-06-07 19:15:39 -130.03 mins
##  5 2022-06-07 17:07:45 2022-06-07 19:14:46 -127.02 mins
##  6 2022-06-23 19:21:46 2022-06-23 19:22:57   -1.18 mins
##  7 2022-06-07 17:05:42 2022-06-07 19:14:47 -129.08 mins
##  8 2022-06-07 16:07:28 2022-06-07 16:18:37  -11.15 mins
##  9 2022-06-07 17:05:41 2022-06-07 18:47:01 -101.33 mins
## 10 2022-06-07 17:05:24 2022-06-07 19:11:33 -126.15 mins
## # … with 90 more rows

The values were correctly switched. Now the duration should be recalculated:

duration <- cleaned_data$ended_at - cleaned_data$started_at
duration <- round(make_difftime(duration, units="minute"), 2)
cleaned_data[, "duration"] <- duration
head(cleaned_data[, c("started_at", "ended_at", "duration")])
## # A tibble: 6 × 3
##   started_at          ended_at            duration  
##   <dttm>              <dttm>              <drtn>    
## 1 2022-01-13 11:59:47 2022-01-13 12:02:44  2.95 mins
## 2 2022-01-10 08:41:56 2022-01-10 08:46:17  4.35 mins
## 3 2022-01-25 04:53:40 2022-01-25 04:58:01  4.35 mins
## 4 2022-01-04 00:18:04 2022-01-04 00:33:00 14.93 mins
## 5 2022-01-20 01:31:10 2022-01-20 01:37:12  6.03 mins
## 6 2022-01-11 18:48:09 2022-01-11 18:51:31  3.37 mins

Checking if there is still negative duration:

min(cleaned_data$duration); mean(cleaned_data$duration); median(cleaned_data$duration); max(cleaned_data$duration)
## Time difference of 0 mins
## Time difference of 16.33219 mins
## Time difference of 10.27 mins
## Time difference of 34354.07 mins

4. View the data

Checking the data:

summary(cleaned_data)
##    ride_id          rideable_type        started_at                    
##  Length:5661859     Length:5661859     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:17:08.50  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:11:15.00  
##                                        Mean   :2022-07-20 07:27:27.55  
##                                        3rd Qu.:2022-09-16 07:38:28.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##     ended_at                        start_lat       start_lng     
##  Min.   :2022-01-01 00:01:48.00   Min.   :41.64   Min.   :-87.84  
##  1st Qu.:2022-05-28 19:37:21.00   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :2022-07-22 15:30:24.00   Median :41.90   Median :-87.64  
##  Mean   :2022-07-20 07:43:47.49   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:2022-09-16 07:51:35.00   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :2023-01-01 18:09:37.00   Max.   :45.64   Max.   :-73.80  
##     end_lat         end_lng       member_casual        duration       
##  Min.   : 0.00   Min.   :-88.14   Length:5661859     Length:5661859   
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class :difftime  
##  Median :41.90   Median :-87.64   Mode  :character   Mode  :numeric   
##  Mean   :41.90   Mean   :-87.65                                       
##  3rd Qu.:41.93   3rd Qu.:-87.63                                       
##  Max.   :42.37   Max.   :  0.00
unique(cleaned_data$rideable_type); unique(cleaned_data$user_type)
## [1] "electric_bike" "classic_bike"  "docked_bike"
## NULL

Check how many stations there are:

unique(cleaned_data[, c("start_lat", "start_lng")])
## # A tibble: 1,952,378 × 2
##    start_lat start_lng
##        <dbl>     <dbl>
##  1      42.0     -87.7
##  2      42.0     -87.7
##  3      41.9     -87.7
##  4      42.0     -87.7
##  5      41.9     -87.6
##  6      41.9     -87.7
##  7      42.0     -87.7
##  8      41.9     -87.7
##  9      41.9     -87.6
## 10      42.0     -87.7
## # … with 1,952,368 more rows

The previous result is quite shocking, because it seems that there were 1.952.378 different start-stations while the total trips in that year were 5.661.859. It shows that I have used a wrong assumption that each station has a unique coordinate (_latand _lng). It seems that the coordinate are for start- and end-bike-position and not station’s coordinate.

Let’s check how many stations are there:

length(unique(trip_2022$start_station_id)); length(unique(trip_2022$end_station_id)); length(unique(trip_2022$start_station_name)); length(unique(trip_2022$end_station_name));
## [1] 1314
## [1] 1318
## [1] 1675
## [1] 1693
length(unique(c(trip_2022$start_station_id, trip_2022$end_station_id)));
## [1] 1323
length(unique(c(trip_2022$start_station_name, trip_2022$end_station_name)))
## [1] 1710

Checking back missing value of original data:

colSums(is.na(na_df))[colSums(is.na(na_df)) > 0]
## start_station_name   start_station_id   end_station_name     end_station_id 
##             833064             833064             892742             892742 
##            end_lat            end_lng 
##               5858               5858

Summary for missing values:

  • data with missing value about start-station -> 833064/5667717 * 100%: 14.7%
  • data with missing value about end-station -> 892742/5667717 * 100%: 15.8%
  • data with missing end position (end_lat and end_lng) -> 5858/5667717 * 100%: 0.1%

It seems removing rows with missing value is not an option, because removing 15.8% rows is just too much. Therefore, I use different subset of data for different analysis, i.e.:

  • for distance-related analysis, the subset data without missing _lat and _lng values is used,
  • for station-related analysis, the subset data without missing station_name is used.

To be able to perform that, I went back to original data and perform cleaning without removing any columns.

5. Go back to square one. Clean original data

In this section, the same data transformation process from section 2.4-3 were performed to original data with difference: no columns and rows are removed. This step is necessary due to the wrong assumption mentioned in last section.

Create column duration and get nrow for duration < 0:

duration_min = trip_2022$ended_at - trip_2022$started_at
duration_min <- round(make_difftime(duration_min, units="minute"), 2)
trip_2022_modified <- trip_2022 %>% mutate(duration_min = duration_min)
nrow(trip_2022_modified[(trip_2022_modified$duration_min) < 0, ])
## [1] 100

Repair row with negative duration and get nrow for duration < 0:

# switching two columns ("ended_at", "started_at") for rows with negative duration:
old_start_end <- select(trip_2022_modified[trip_2022_modified$duration_min < 0, ], c(started_at, ended_at))
trip_2022_modified[trip_2022_modified$duration_min < 0, c("ended_at", "started_at")] <- old_start_end
# calculate new duration for rows with negative duration:
trip_2022_modified$duration_min <- trip_2022_modified$ended_at - trip_2022_modified$started_at
trip_2022_modified$duration_min <- as.numeric(round(make_difftime(trip_2022_modified$duration_min, units="minute"), 2))
# check how many rows has negative duration. At this point, it should be 0.
nrow(trip_2022_modified[(trip_2022_modified$duration_min) < 0, ])
## [1] 0
colnames(trip_2022_modified)
##  [1] "ride_id"            "bike_type"          "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "user_type"          "duration_min"

6. Visualizing data

In this section, the data are visualized using different parameters. I also presented my findings which start with \(\color{blue}{\text{->}}\) symbol.

6.1. Calculate trip distance

distance_m = distHaversine(trip_2022_modified[,c('start_lng','start_lat')], 
                 trip_2022_modified[,c('end_lng','end_lat')])

trip_2022_modified <- trip_2022_modified %>% mutate(distance_m=distance_m)

6.2. Calculate speed kmh

trip_2022_modified <- trip_2022_modified %>% mutate(speed_kmh=round((distance_m*60)/(1000*duration_min), 2))
summary(trip_2022_modified[, c('distance_m', 'speed_kmh')])
##    distance_m        speed_kmh    
##  Min.   :      0   Min.   : 0.00  
##  1st Qu.:    874   1st Qu.: 7.24  
##  Median :   1576   Median :10.61  
##  Mean   :   2142   Mean   :  Inf  
##  3rd Qu.:   2783   3rd Qu.:13.80  
##  Max.   :9825063   Max.   :  Inf  
##  NA's   :5858      NA's   :5962

Create subset for distance related analysis:

subset_clean_distance <- trip_2022_modified[!(is.na(trip_2022_modified$distance_m)), ]

summary(subset_clean_distance[, c('distance_m', 'speed_kmh')])
##    distance_m        speed_kmh    
##  Min.   :      0   Min.   : 0.00  
##  1st Qu.:    874   1st Qu.: 7.24  
##  Median :   1576   Median :10.61  
##  Mean   :   2142   Mean   :  Inf  
##  3rd Qu.:   2783   3rd Qu.:13.80  
##  Max.   :9825063   Max.   :  Inf  
##                    NA's   :104

Find faulty data:

speed_sorted <- unique(arrange(subset_clean_distance, desc(speed_kmh))$speed_kmh)
head(speed_sorted, 20)
##  [1]        Inf 2357738.52  421024.67  134580.42  127573.59  105621.53
##  [7]   76241.93   65348.55   28854.81   27694.29    3846.29    3647.58
## [13]    2489.58    2485.69    2290.83    2202.49    2169.14    1964.01
## [19]    1963.99    1947.04

As can be seen above, there are still many faulty data. It is doubtful, that a bike can reach 2357738 kmh. Therefore, it makes sense to only use data with speed < 50 kmh.

subset_clean_distance <- subset(subset_clean_distance, speed_kmh <= 50) 

# Round the speed_kmh to scale down the data. Necessary step, otherwise the graph would differentiate data for 5.02 and 5.24 kmh, instead of 5 kmh for both values.
subset_clean_distance[, "speed_kmh"] <- round(subset_clean_distance[, "speed_kmh"], 0)

summary(subset_clean_distance)
##    ride_id           bike_type           started_at                    
##  Length:5650877     Length:5650877     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 18:55:57.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 14:50:19.00  
##                                        Mean   :2022-07-20 06:59:53.77  
##                                        3rd Qu.:2022-09-16 07:28:17.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5650877     Length:5650877    
##  1st Qu.:2022-05-28 19:15:54.00   Class :character   Class :character  
##  Median :2022-07-22 15:08:46.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:16:15.57                                        
##  3rd Qu.:2022-09-16 07:43:19.00                                        
##  Max.   :2023-01-01 18:09:37.00                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5650877     Length:5650877     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng        user_type          duration_min     
##  Min.   :41.55   Min.   :-88.14   Length:5650877     Min.   :    0.02  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:    5.83  
##  Median :41.90   Median :-87.64   Mode  :character   Median :   10.30  
##  Mean   :41.90   Mean   :-87.65                      Mean   :   16.36  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:   18.45  
##  Max.   :42.37   Max.   :-87.30                      Max.   :34354.07  
##    distance_m        speed_kmh    
##  Min.   :    0.0   Min.   : 0.00  
##  1st Qu.:  876.6   1st Qu.: 7.00  
##  Median : 1580.0   Median :11.00  
##  Mean   : 2130.9   Mean   :10.35  
##  3rd Qu.: 2787.0   3rd Qu.:14.00  
##  Max.   :42319.5   Max.   :50.00

6.3. Visualize speed_kmh

# Count how often is every combination of "speed_kmh" and "bike_type" occurs.
count_bike_type_speed <- subset_clean_distance %>% 
  count(bike_type, speed_kmh)

# Count how often does each speed value occur
count_speed <- subset_clean_distance %>% 
  count(speed_kmh)

# Merge the values from both last steps into a table (as preparation for next step)
count_bike_type_speed <- count_bike_type_speed %>% 
  merge(count_speed, by = "speed_kmh") %>% 
  rename(count=n.x, total=n.y)

# Calculate the distribution by dividing first value with second value from previous steps.
count_bike_type_speed <- count_bike_type_speed %>% 
  mutate(rel_count=(count/total)*100)

head(count_bike_type_speed)
##   speed_kmh     bike_type  count  total rel_count
## 1         0  classic_bike 177054 415109 42.652412
## 2         0   docked_bike  39535 415109  9.524005
## 3         0 electric_bike 198520 415109 47.823584
## 4         1  classic_bike  49705 116770 42.566584
## 5         1 electric_bike  52844 116770 45.254774
## 6         1   docked_bike  14221 116770 12.178642
ggplot(count_bike_type_speed) +
  geom_bar(aes(x=speed_kmh, y=rel_count, fill=bike_type), stat="identity") +
  scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))

count_bike_type_speed %>% group_by(bike_type) %>% 
  summarize(mean(speed_kmh), median(speed_kmh), sd(speed_kmh))
## # A tibble: 3 × 4
##   bike_type     `mean(speed_kmh)` `median(speed_kmh)` `sd(speed_kmh)`
##   <chr>                     <dbl>               <dbl>           <dbl>
## 1 classic_bike               21.9                21             13.9 
## 2 docked_bike                13.2                12.5            9.31
## 3 electric_bike              25                  25             14.9

\(\color{blue}{\text{->}}\) The graph and table above show that electric bike user bikes 4 kmh faster than classic bike, while docked bike user is almost half as fast as electric.

# Count how often is every combination of "speed_kmh" and "user_type" occurs.
count_user_type_speed <- subset_clean_distance %>% 
  count(user_type, speed_kmh)

# Merge the table with total speed count (as preparation for next step)
count_user_type_speed <- count_user_type_speed %>% 
  merge(count_speed, by = "speed_kmh") %>% 
  rename(count=n.x, total=n.y)

# Calculate the distribution by dividing first value with second value from previous steps.
count_user_type_speed <- count_user_type_speed %>% 
  mutate(rel_count=(count/total)*100)

ggplot(count_user_type_speed) +
  geom_bar(aes(x=speed_kmh, y=rel_count, fill=user_type), stat="identity") +
  scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))

subset_clean_distance %>% group_by(user_type) %>% 
  summarize(mean(speed_kmh), median(speed_kmh), sd(speed_kmh))
## # A tibble: 2 × 4
##   user_type  `mean(speed_kmh)` `median(speed_kmh)` `sd(speed_kmh)`
##   <chr>                  <dbl>               <dbl>           <dbl>
## 1 member                 11.3                   11            5.17
## 2 non-member              8.96                   9            5.66

\(\color{blue}{\text{->}}\) The graphic and table above show that non-member biker rides a bit slower than member user.

6.4. Visualize bike_type

# Calculating relative distribution
rel_dist <- trip_2022_modified %>% count(bike_type, user_type)
rel_dist <- rel_dist %>% group_by(bike_type) %>% mutate(total=sum(n))

# Calculating text position for next visualization
rel_dist <- rel_dist %>% arrange(desc(n)) %>% 
  mutate(rel_freq = (n/total)*100, pos = cumsum(n) - (0.5 * n)) %>% 
  rename(count=n)
rel_dist
## # A tibble: 5 × 6
## # Groups:   bike_type [3]
##   bike_type     user_type    count   total rel_freq      pos
##   <chr>         <chr>        <int>   <int>    <dbl>    <dbl>
## 1 classic_bike  member     1709755 2601214     65.7  854878.
## 2 electric_bike member     1635930 2889029     56.6  817965 
## 3 electric_bike non-member 1253099 2889029     43.4 2262480.
## 4 classic_bike  non-member  891459 2601214     34.3 2155484.
## 5 docked_bike   non-member  177474  177474    100     88737
ggplot() + 
  geom_bar(data=rel_dist, aes(x=bike_type, y=count, fill=user_type), stat="identity") +
  geom_text(data=rel_dist, aes(x=bike_type, y=pos, label=paste(round(rel_freq, 0), "%", sep=" ")), 
            vjust=1,color="white", position = position_dodge(width=1), size=4)

trip_2022_modified %>% group_by(bike_type) %>% 
  summarize(mean(duration_min), median(duration_min), sd(duration_min))
## # A tibble: 3 × 4
##   bike_type     `mean(duration_min)` `median(duration_min)` `sd(duration_min)`
##   <chr>                        <dbl>                  <dbl>              <dbl>
## 1 classic_bike                  19.0                  10.9                61.8
## 2 docked_bike                  123.                   28.0               958. 
## 3 electric_bike                 13.5                   9.33               17.2

\(\color{blue}{\text{->}}\) The table above shows that average usage duration for docked bike is much longer and interestingly the graph above shows it is being rented much less than other bike types.

6.5. Visualize duration

subset_duration <- trip_2022_modified %>% select(bike_type, user_type, duration_min, distance_m)

# Round the duration_min to scale down the data. Necessary step, otherwise the graph would differentiate data for 10.02 and 10.1 minutes, instead of 10 minutes for both values.
subset_duration[, "duration_min"] <- round(subset_duration[, "duration_min"], 0)

summary(subset_duration)
##   bike_type          user_type          duration_min        distance_m     
##  Length:5667717     Length:5667717     Min.   :    0.00   Min.   :      0  
##  Class :character   Class :character   1st Qu.:    6.00   1st Qu.:    874  
##  Mode  :character   Mode  :character   Median :   10.00   Median :   1576  
##                                        Mean   :   19.45   Mean   :   2142  
##                                        3rd Qu.:   18.00   3rd Qu.:   2783  
##                                        Max.   :41387.00   Max.   :9825063  
##                                                           NA's   :5858

\(\color{blue}{\text{->}}\) Mean usage duration is almost 20 minutes.

subset_duration %>% subset(duration_min < 120) %>% 
  ggplot() + geom_bar(aes(x=duration_min, fill=user_type)) +
  scale_y_log10()

Let’s create a function to visualize relative distribution:

vis_rel_dist <- function(data, param, fill_param) {
  # function to visualize relative distribution with param as x axis and fill_param as fill.
  rel_dist <- data %>% count({{ param }}, {{ fill_param }})
  rel_dist <- rel_dist %>% group_by({{ param }}) %>% mutate(total=sum(n))
  rel_dist <- rel_dist %>% mutate(rel_freq = (n/total)*100)
  
  ggplot(rel_dist) +
    geom_bar(aes(x={{ param }}, y=rel_freq, fill={{ fill_param }}), stat="identity")
}
vis_rel_dist(subset(subset_duration, duration_min < 120), duration_min, user_type) +
  scale_y_continuous(labels=function(x) paste(x,"%", sep=" "))

It seems using duration in minute is not practical to show general daily trend. Let’s transform the data to visualize duration in hour:

subset_duration <- subset_duration %>% mutate(duration_hour = round((duration_min/60), 0))

Visualize duration_hour:

subset_duration %>% subset(duration_hour < 48) %>% 
  ggplot() + geom_bar(aes(x=duration_hour, fill=user_type)) +
  scale_y_log10()

subset_duration %>% group_by(user_type) %>% 
  summarize(mean(duration_min), median(duration_min), sd(duration_min))
## # A tibble: 2 × 4
##   user_type  `mean(duration_min)` `median(duration_min)` `sd(duration_min)`
##   <chr>                     <dbl>                  <dbl>              <dbl>
## 1 member                     12.7                      9               29.4
## 2 non-member                 29.1                     13              273.

\(\color{blue}{\text{->}}\) As can be seen from the graph and table above, non-member user rents bike with longer duration than user with membership.

Let’s analyze bike rent which > 24 hours:

subset_duration_24h <- subset_duration %>% subset(duration_hour >24)
print(paste("Total amount of rent duration 24 hours is", nrow(subset_duration_24h), "or", round(nrow(subset_duration_24h)/nrow(subset_duration) * 100, 2), "% of all order."))
## [1] "Total amount of rent duration 24 hours is 5311 or 0.09 % of all order."
subset_duration_24h %>% 
  count(user_type) %>% 
  mutate(relative_frequency=round((n/sum(n))*100))
## # A tibble: 2 × 3
##   user_type      n relative_frequency
##   <chr>      <int>              <dbl>
## 1 member       701                 13
## 2 non-member  4610                 87

\(\color{blue}{\text{->}}\) The table shows that 87% for all rent which longer than 24 hours are done by non-member user. This might be our potential campaign target. Since the dataset doesn’t give information, if these customers are one-time or frequent users, it is unknown, how much of 4610 non-member user rent frequently.

subset_duration_24h %>% 
  filter(user_type == "non-member") %>% 
  count(bike_type) %>% 
  mutate(relative_frequency=round((n/sum(n))*100))
## # A tibble: 2 × 3
##   bike_type        n relative_frequency
##   <chr>        <int>              <dbl>
## 1 classic_bike  2593                 56
## 2 docked_bike   2017                 44

\(\color{blue}{\text{->}}\) Non-member biker who rents longer than 24 hours prefers classic bike.

colnames(subset_duration)
## [1] "bike_type"     "user_type"     "duration_min"  "distance_m"   
## [5] "duration_hour"
subset_duration %>% 
  filter(user_type == "non-member") %>% 
  group_by(bike_type) %>% 
  drop_na(distance_m, duration_min) %>% 
  summarize(mean_distance=mean(distance_m), mean_duration_minute=mean(duration_min))
## # A tibble: 3 × 3
##   bike_type     mean_distance mean_duration_minute
##   <chr>                 <dbl>                <dbl>
## 1 classic_bike          2087.                 24.5
## 2 docked_bike           2176.                 50.7
## 3 electric_bike         2253.                 16.2

6.6. Visualizing user_type

# Preparing data for visualization
# pos value is for coordinate position used in the next ggplot visualization
pos_1 <-trip_2022_modified %>% 
  count(user_type)%>% 
  mutate(rel_n = round(((n/sum(n))*100), 0))

pos_2 <- trip_2022_modified %>% 
  count(user_type, bike_type)
pos_2 <- pos_2 %>% 
  group_by(user_type) %>% 
  mutate(total=sum(n)) %>% 
  mutate(rel_freq=round((n/total)*100))
pos_2 <- pos_2 %>% 
  arrange(user_type, desc(bike_type)) %>% 
  mutate(pos=cumsum(n) - (0.5 * n))

pos_2
## # A tibble: 5 × 6
## # Groups:   user_type [2]
##   user_type  bike_type           n   total rel_freq      pos
##   <chr>      <chr>           <int>   <int>    <dbl>    <dbl>
## 1 member     electric_bike 1635930 3345685       49  817965 
## 2 member     classic_bike  1709755 3345685       51 2490808.
## 3 non-member electric_bike 1253099 2322032       54  626550.
## 4 non-member docked_bike    177474 2322032        8 1341836 
## 5 non-member classic_bike   891459 2322032       38 1876302.
# Visualize with data from pos_1 and pos_2
ggplot() + 
  geom_bar(data = trip_2022_modified, aes(x=user_type, fill=bike_type)) +
  geom_text(data=pos_1, aes(x=user_type, y=n, label=paste(round(rel_n, 0), "%", sep=" ")),
            vjust=-0.5,color="black", position = position_dodge(width=1), size=4) +
  geom_text(data=pos_2, aes(x=user_type, y=pos, label=paste(round(rel_freq, 0), "%", sep=" ")),
          vjust=0.5,color="white", position = position_dodge(width=1), size=4)

\(\color{blue}{\text{->}}\) About 60% of rent was generated by member user. Non-member user prefers renting electric bike. Furthermore, all docked bike is only rented by non-member biker.

6.7. Analyzing distance

trip_2022_modified %>% group_by(user_type) %>% 
  drop_na() %>% 
  summarize(mean(distance_m), median(distance_m), sd(distance_m))
## # A tibble: 2 × 4
##   user_type  `mean(distance_m)` `median(distance_m)` `sd(distance_m)`
##   <chr>                   <dbl>                <dbl>            <dbl>
## 1 member                  2071.                1499.           14997.
## 2 non-member              2163.                1651.           10695.

\(\color{blue}{\text{->}}\) Non-member and member users bike in average around 2 kilometers.

trip_2022_modified %>% group_by(bike_type) %>% 
  drop_na() %>% 
  summarize(mean(distance_m), median(distance_m), sd(distance_m))
## # A tibble: 3 × 4
##   bike_type     `mean(distance_m)` `median(distance_m)` `sd(distance_m)`
##   <chr>                      <dbl>                <dbl>            <dbl>
## 1 classic_bike               2007.                1471.           16221.
## 2 docked_bike                2176.                1601.            2288.
## 3 electric_bike              2265.                1716.            8067.

\(\color{blue}{\text{->}}\) The difference in biking distance for different bike types is very minimal (around 10%).

6.8. Analyzing stations

station_name <- unlist(c(list(trip_2022_modified$start_station_name), list(trip_2022_modified$end_station_name)), recursive = FALSE)
print(paste("There are", length(unique(station_name)), "different stations."))
## [1] "There are 1710 different stations."
station_df <- data.frame(station_name) %>% count(station_name) %>% arrange(desc(n)) %>% mutate(rel_frequency = round((n/sum(n))*100, 2))
head(station_df)
##                         station_name       n rel_frequency
## 1                               <NA> 1725806         15.22
## 2            Streeter Dr & Grand Ave  150619          1.33
## 3 DuSable Lake Shore Dr & North Blvd   82231          0.73
## 4  DuSable Lake Shore Dr & Monroe St   81404          0.72
## 5              Michigan Ave & Oak St   79788          0.70
## 6              Wells St & Concord Ln   74936          0.66

\(\color{blue}{\text{->}}\) It seems that there is many missing station name in dataset. The table above also shows favourite station names.

7. Recommendations

Based on findings from section 6, I recommend the following points as focus for marketing campaign:

  • 41% biker doesn’t have membership and 54% of non-member biker prefers renting electric bike and 38% classic bike. This group can be targeted with an (3 months) introduction program for electric bike.
  • 87% biker who rents longer than 24 hours is not a member and they prefer classic bike (56%) and docked bike (44%). Targeting non-member biker with classic bike program for multiple days is worth considering only after experimenting with the first point, because only 0.09% rent has duration > 24 hours.
  • Renting docked bike is suitable for short distance biking and people not having a bike. Interestingly the average distance for docked bike is 2176 meter, which is 169 meter longer than classic bike (89 meter shorter than electric bike). The average usage time for docked bike is 2X longer than classic- and 3X electric bike. Developing a more attractive offer for docked bike might be a good starting point to attract new customer.

—– The End —-