Spotify Wrapped: R & ggplot2 Edition
About
While Spotify presents us with Spotify Wrapped at the end of each year which summarizes our top songs, artists, and minute spent listening, I was curious about how my Spotify listening patterns have evolved over a much longer time frame. Furthermore, the data analysis presented through Spotify Wrapped is fairly shallow. Spotify will provide your personal Spotify data free to use and analyze, which is what I did.
I was mainly curious about things like the time of the day I listened to songs, top artists over time, top songs over time, and the number of times I listened to a song each day. However, there are some other variables I wanted to explore such as how tracks end (e.g., stop button, next song) as well as how far I get through song.
Download the data and follow along: Download
Load Packages
require(pacman)
p_load(tidyverse, jsonlite, knitr, gghighlight, plotly, hms, lubridate, here, gt, webshot, janitor, spotidy)
Exporting your Spotify Data
To export your Spotify data, you will need to do so from the Spotify app. Once your data is requested, you will receive an email with instructions on how to download it. Depending on the length of your Spotify history, there will be one or multiple data files. In my case, there were six files which I loaded into R using the jsonlite
package then combined them using rbind
.
Reading your Spotify Data into R
To read your Twitter data into R, we are going to be using the jsonlite
package using the flatten = TRUE
parameter to
automatically clean up the nesting and children in the file.
StreamHistory0 <- fromJSON(here("Zip/MyData/endsong_0.json"), flatten = TRUE)
StreamHistory1 <- fromJSON(here("Zip/MyData/endsong_1.json"), flatten = TRUE)
StreamHistory2 <- fromJSON(here("Zip/MyData/endsong_2.json"), flatten = TRUE)
StreamHistory3 <- fromJSON(here("Zip/MyData/endsong_3.json"), flatten = TRUE)
StreamHistory4 <- fromJSON(here("Zip/MyData/endsong_4.json"), flatten = TRUE)
StreamHistory5 <- fromJSON(here("Zip/MyData/endsong_5.json"), flatten = TRUE)
StreamHistory6 <- fromJSON(here("Zip/MyData/endsong_6.json"), flatten = TRUE)
streamingData <- rbind(StreamHistory0, StreamHistory1, StreamHistory2, StreamHistory3, StreamHistory4, StreamHistory5, StreamHistory6)
spotify <- streamingData %>%
as_tibble() %>%
separate(col = "ts",
into = c("date","time"),
sep = "T") %>%
separate(col = "time",
into = "time",
sep = "Z")
datetime <- as.POSIXct(paste(spotify$date, spotify$time), format = "%Y-%m-%d %H:%M:%S")
spotify$datetime <- datetime
spotify <- spotify %>%
mutate(datetime = datetime - hours(6)) %>% # Convert time zones to CST
mutate(date = floor_date(datetime, "day") %>% # Creating date
as_date, minutes = ms_played / 60000) %>% # Convert ms played to minutes played
mutate(time = floor_date(datetime, "minutes")) # Remove seconds from time
The following columns will now be present in your data frame.
spotify %>%
glimpse(24)
Rows: 99,835
Columns: 24
$ date <date> …
$ time <dttm> …
$ username <chr> …
$ platform <chr> …
$ ms_played <int> …
$ conn_country <chr> …
$ ip_addr_decrypted <chr> …
$ user_agent_decrypted <chr> …
$ master_metadata_track_name <chr> …
$ master_metadata_album_artist_name <chr> …
$ master_metadata_album_album_name <chr> …
$ spotify_track_uri <chr> …
$ episode_name <chr> …
$ episode_show_name <chr> …
$ spotify_episode_uri <chr> …
$ reason_start <chr> …
$ reason_end <chr> …
$ shuffle <lgl> …
$ skipped <lgl> …
$ offline <lgl> …
$ offline_timestamp <dbl> …
$ incognito_mode <lgl> …
$ datetime <dttm> …
$ minutes <dbl> …
Listening Patterns Over Time
I am filtering out instances where a song was listened to for 1000 milliseconds (1 second) or less as these songs were not actually listened to.
Daily Songs
spotify %>%
filter(ms_played >= 1000) %>%
group_by(date) %>%
# group_by(date = floor_date(date, "day")) %>% # This does not work after updates.
summarize(songs = n()) %>%
arrange(date) %>%
ggplot(aes(x = date, y = songs)) +
geom_col(aes(fill = songs, colour = songs)) + # Use `colour = ` here because using `fill = ` does not work with small lines.
scale_x_date(breaks = "1 year",
date_labels = "%Y",
expand = c(0, 0)) + # Removes white space around the plot on the x axis
scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
scale_colour_gradient(high = "#0b3e34", low = "#1db954") +
labs(x = "Date",
y = "Number of Songs",
colour = "Songs") +
guides(fill = "none") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
Monthly Songs
spotify %>%
filter(ms_played >= 1000) %>%
group_by(date) %>%
group_by(date = floor_date(date, "month")) %>%
summarize(songs = n()) %>%
arrange(date) %>%
ggplot(aes(x = date, y = songs)) +
geom_col(aes(fill = songs, colour = songs)) + # Use `colour = ` here because using `fill = ` does not work with small lines.
scale_x_date(breaks = "1 year",
date_labels = "%Y",
expand = c(0, 0)) + # Removes white space around the plot on the x axis
scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
scale_colour_gradient(high = "#0b3e34", low = "#1db954") +
labs(x = "Date",
y = "Number of Songs",
colour = "Songs") +
guides(fill = "none") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
Listening Calendar
This figure shows the total number of minutes I spent listening to Spotify during one hour blocks throughout the week from 2014 to 2022. Interestingly, I listen to the most music from 3PM to about 1AM. This makes sense because I often to go sleep at about 1AM and I have worked until 3-5PM.
spotify %>%
group_by(date, hour = hour(datetime), weekday = wday(date, label = TRUE)) %>%
summarize(minuteslistened = sum(minutes)) %>%
mutate(year = format(date, "%Y")) %>%
group_by(hour, weekday) %>%
summarize(minuteslistened = sum(minuteslistened)) %>%
drop_na() %>% # NA Weekday value of 181 dropped.
ggplot(aes(weekday, hour, fill = minuteslistened)) +
geom_tile(colour = "white", size = 0.1) +
scale_fill_gradient(high = "#0b3e34", low = "#1db954") +
scale_y_continuous(trans = "reverse") +
theme_bw() +
labs(x = "Weekday",
y = "Hour of the Day (24-Hour, CST)",
fill = "Minutes",
title = "Spotify Weekly Listening Heatmap",
subtitle = "2014 to 2022") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
Artists
Top 20 Artists - All Time
spotify %>%
mutate(year = floor_date(date, "year")) %>%
mutate(year = format(year, "%Y")) %>%
group_by(master_metadata_album_artist_name) %>%
summarize(hours = sum(minutes) / 60) %>%
drop_na() %>%
arrange(desc(hours)) %>%
head(20, hours) %>%
gt() %>%
tab_header(title = "Top 20 Artists All Time") %>%
cols_label(
master_metadata_album_artist_name = "Artist",
hours = "Hours Listened") %>%
fmt_number(columns = hours,
rows = everything(),
decimals = 0) %>%
as_raw_html()
Top 20 Artists All Time | |
spotify %>%
group_by(master_metadata_album_artist_name, date = floor_date(date, "month")) %>%
filter(master_metadata_album_artist_name %in% c("Armin van Buuren",
"GAIA",
"Above & Beyond",
"Gareth Emery",
"Alex M.O.R.P.H.",
"Gustav Mahler",
"Lady Gaga",
"ReOrder",
"Halsey",
"Aly & Fila")) %>%
summarize(hours = sum(minutes) / 60) %>%
drop_na() %>%
ggplot(aes(x = date, y = hours, group = master_metadata_album_artist_name,
colour = master_metadata_album_artist_name)) +
geom_line() +
scale_x_date(breaks = "1 year", date_labels = "%Y") +
gghighlight(master_metadata_album_artist_name == "Armin van Buuren" ||
master_metadata_album_artist_name == "Gustav Mahler" ||
master_metadata_album_artist_name == "Lady Gaga" ||
master_metadata_album_artist_name == "Halsey") +
labs(title = "Hours Listened to Most Listened Artists Over Time",
subtitle = "Data for top 10 artists only \nData aggregated by month. Other grey lines are top Trance artists",
x = "Date",
y = "Hours Listened") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank())
Listening Device
device_time <- spotify %>%
mutate(device = case_when(str_detect(platform, 'Android') ~ 'Android',
str_detect(platform, 'iOS') ~ 'iPhone',
str_detect(platform, 'Windows|windows|web') ~ 'Windows',
str_detect(platform, 'google') ~ 'Google Home'
)) %>%
group_by(device) %>%
summarise_at(vars(minutes), sum) %>%
mutate(hours = minutes/60) %>%
mutate(days = hours/24)
device_time %>%
add_row(
device_time %>%
summarise(across(2:last_col(), sum) %>%
mutate(device = "Total"))
) %>%
gt() %>%
tab_header(title = "Time Spent Listening by Platform") %>%
cols_label(
device = "Platform",
minutes = "Minutes",
hours = "Hours",
days = "Days") %>%
fmt_number(columns = c(minutes, hours, days),
rows = everything(),
decimals = 0) %>%
as_raw_html()
Time Spent Listening by Platform | |||
Track Ends
Spotify provides data that indicates how each track was stopped. For example, if the user selects the play button, or the track completes and moves onto the next track, that information is recorded.
appload
: Track stopped upon opening of the app.backbtn
: User pressed the back button.clickrow
: User selected another song within an album/playlist.fwdbtn
: User pressed the forward button.persisted
:playbtn
: User pressed the play button.popup
:remote
:trackdone
: The track completed in its entirety.trackerror
: An error caused the track to stop.unknown
: Unknown reason.
For the majority of tracks (72,814; 72.93%), the track ended because the song finished (trackdone
). For 17,754 tracks (17.78%), the track ended because I clicked a new song via an album or playlist (clickrow
). I use the forward button far more than the back button (3.72% of tracks were ended via the forward button). This is likely due to a song coming on that I wish to skip. Only 0.93% of tracks were ended via the back button, indicating I wished to repeat the previous song.
spotify %>%
group_by(reason_start) %>%
mutate(reason_start = replace(reason_start, reason_start == "", "unknown")) %>%
dplyr::summarise(n = n()) %>%
mutate(percent = (n / sum(n))*100) %>%
gt() %>%
fmt_number(
columns = percent,
rows = everything(),
decimals = 2
) %>%
as_raw_html()
reason_start | n | percent |
---|---|---|
I wanted to see if I preferentially used the back button at certain points (e.g., within an album or playlist) to replay songs. Seen below, 911 and Chromatica I by Lady Gaga are the most ended tracks via the back button.
spotify %>%
filter(reason_end == "backbtn") %>%
group_by(master_metadata_track_name) %>%
dplyr::summarise(n = n()) %>%
drop_na() %>%
arrange(desc(n)) %>%
head(10) %>%
gt() %>%
as_raw_html()
master_metadata_track_name | n |
---|---|
Next, I wanted to see how the way a track is ended relates to the time the song is listened to. Seen below, in the first few seconds of a track, the majority of tracks are ended by the stop button, forward button, and back button. After a certain threshold (~30s), the majority of tracks are played to completion. This indicates that, in the first few seconds of a song playing, I typically will stop the song within 30 seconds if I dislike it.
spotify %>%
mutate(reason_end = replace(reason_end, reason_end == "unknown", "Unknown"),
reason_end = replace(reason_end, reason_end == "", "Unknown"),
reason_end = replace(reason_end, reason_end == "unexpected-exit", "Misc"),
reason_end = replace(reason_end, reason_end == "unexpected-exit-while-paused", "Misc"),
reason_end = replace(reason_end, reason_end == "remote", "Misc"),
reason_end = replace(reason_end, reason_end == "trackerror", "Misc"),
reason_end = replace(reason_end, reason_end == "logout", "Misc"),
reason_end = replace(reason_end, reason_end == "popup", "Misc"),
reason_end = replace(reason_end, reason_end == "appload", "App Load"),
reason_end = replace(reason_end, reason_end == "backbtn", "Back Button"),
reason_end = replace(reason_end, reason_end == "fwdbtn", "Forward Button"),
reason_end = replace(reason_end, reason_end == "endplay", "Stop Button"),
reason_end = replace(reason_end, reason_end == "trackdone", "Track Done"),
reason_end = replace(reason_end, reason_end == "clickrow", "New Song"),
) %>%
ggplot(aes(x = minutes)) +
geom_histogram(aes(fill = reason_end), bins = 150) +
scale_x_continuous(limits = c(0,10), expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
scale_fill_brewer("Track End \nReason", palette = "Paired")+
theme_bw() +
labs(x = "Minutes Played",
y = "Track End Instances",
title = "Reason for Spotify Track Ends - Personal Data",
subtitle = "For songs played for 10 minutes or less",
caption = "Viz: Cole Baril - colebaril.ca | Data: Spotify | Software: R") +
theme(title = element_text(size = 16, face = "bold"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1))
References
library(purrr)
c("tidyverse", "jsonlite", "knitr", "gghighlight",
"hms", "lubridate", "here", "gt", "webshot") %>%
map(citation) %>%
print(style = "text")
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.
Xie Y (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.40, https://yihui.org/knitr/.
Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/.
Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, http://www.crcpress.com/product/isbn/9781466561595.
Yutani H (2022). gghighlight: Highlight Lines and Points in ‘ggplot2’. R package version 0.4.0, https://CRAN.R-project.org/package=gghighlight.
Müller K (2022). hms: Pretty Time of Day. R package version 1.1.2, https://CRAN.R-project.org/package=hms.
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Müller K (2020). here: A Simpler Way to Find Your Files. R package version 1.0.1, https://CRAN.R-project.org/package=here.
Iannone R, Cheng J, Schloerke B, Hughes E (2022). gt: Easily Create Presentation-Ready Display Tables. R package version 0.7.0, https://CRAN.R-project.org/package=gt.
Chang W (2022). webshot: Take Screenshots of Web Pages. R package version 0.5.4, https://CRAN.R-project.org/package=webshot.