I used Spotify’s API paired with my own web scraping code to download a user’s playlist meta-data and then used Spark to find relevant patterns.
Web Scraping
This is a multi-step process that starts with supplying the Spotify ID of the users you are interested in, and some limits on how much music meta-data to download. I start by downloading all users’ Playlists, from which I get information on every Song. I use that information to find each Album and Artist related to these Songs, as well as the Genres associated with Artists/Albums. I parse this information into my own DataObjects, while dealing with dropped API calls, Null Songs and Albums with only Null Songs.
With this information I write it all in a SQL format along with a custom manifest that details what information is within, how many Songs/Albums, and a calculated Hash that summarizes how I output the information so I can correctly read it. This results in a data Archive consisting of a particular set of Users’ data.
Here’s an example of my output while scraping:
Starting Collection
Getting genre seeds
Getting playlists from 2 users
There are 199 playlists before filtering
Retrieving 0/199 playlists
Retrieving 10/199 playlists
Retrieving 20/199 playlists
Retrieving 30/199 playlists
Retrieving 40/199 playlists
Retrieving 50/199 playlists
Retrieving 60/199 playlists
Retrieving 70/199 playlists
Retrieving 80/199 playlists
Retrieving 90/199 playlists
Retrieving 100/199 playlists
Retrieving 110/199 playlists
Retrieving 120/199 playlists
Retrieving 130/199 playlists
Retrieving 140/199 playlists
Retrieving 150/199 playlists
Retrieving 160/199 playlists
Retrieving 170/199 playlists
Retrieving 180/199 playlists
Retrieving 190/199 playlists
Getting tracks from 12 playlists...
Retrieving Albums...
Retrieving 0/176 albums
Retrieving 20/176 albums
Retrieving 40/176 albums
Retrieving 60/176 albums
Retrieving 80/176 albums
Retrieving 100/176 albums
Retrieving 120/176 albums
Retrieving 140/176 albums
Retrieving 160/176 albums
Getting approximately 531 tracks
Retrieving tracks from Albums 0/176
Retrieving tracks from Albums 50/176
Retrieving tracks from Albums 100/176
Retrieving tracks from Albums 150/176
Getting 187 artists
Retrieving Artists 0/187
Retrieving Artists 20/187
Retrieving Artists 40/187
Retrieving Artists 60/187
Retrieving Artists 80/187
Retrieving Artists 100/187
Retrieving Artists 120/187
Retrieving Artists 140/187
Retrieving Artists 160/187
Retrieving Artists 180/187
Web scraping complete!
----------------------
Playlists: 12
Artists: 187
Albums: 176
Tracks: 2024
Genres: 372
Collection took 116.3168971 seconds
Writing music data for doctorsalt, 1249049206
Writing to file took 0.3326463 seconds
Analysis
Each Archive can be loaded separated to have analyses run on them. They are accessed through an interactive terminal where Admins can load/modify Archives and Users can login to query their data to ascertain new patterns using Spark Sql. A User can ask questions like what Genres do they listen to more than the others, and vice versa, or which user listens to the most profane albums. I correlate Users to average Album length to find patterns like how listens to more Classical music/Film scores versus short Pop songs.