Case Study: Popular Artists Peak Ages using Spotify API and MusicBrainz

src: https://www.spotify.com/us/premium/

Authors

Kyr Nastahunin
Malika Yelyubayeva
Mykyta Paroviy
Nicholas Rachfal
Yelizaveta Semikina

Description of the project

Links:

Spotify API (https://developer.spotify.com/documentation/web-api/)
MusicBrainz API (https://python-musicbrainzngs.readthedocs.io/en/v0.7.1/api/)

Technology used:

  • Python 3 and its libraries
  • Jupyter Notebook

Importing Python libraries

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
import sklearn
import time
import nltk
import final_report
import ml_age

Data wrangling

Read API key and create an instance of Spotify API wrapper

In the name of saving time cleaning our data, we decided to acquire our data by utilizing Spotify’s API.

client_id, client_secret = final_report.read_api_key()
spotify = final_report.initiate_spotify_api(client_id, client_secret)

Get the U.S. top 250 playlists from 2010 to 2021

Using the API, we can now retrieve the Top 100 songs of each year by passing the playist ID’s as parameters to the Spotify Playlist endpoint. The response contains a lot of irrelevant information to our research so we drop any unnecessary columns and added some of our own columns that specify what year the song featured and it’s chart position. As a result, we get a dataframe of the most popular artists and their most popular songs from the last 12 years which contains 1150 songs (2020 only had a top 50 chart).

playlist_ids = ['37i9dQZF1DXc6IFF23C9jj', '37i9dQZF1DXcagnSNtrGuJ', '37i9dQZF1DX0yEZaMOXna3', '37i9dQZF1DX3Sp0P28SIer',
'37i9dQZF1DX0h0QnLkMBl4', '37i9dQZF1DX9ukdrXQLJGZ', '37i9dQZF1DX8XZ6AUo9R4R', '37i9dQZF1DWTE7dVUebpUW',
'37i9dQZF1DXe2bobNYDtW8', '37i9dQZF1DWVRSukIED0e9', '37i9dQZF1DX7Jl5KP2eZaS', '37i9dQZF1DX7EqpAEG8F4f']
top_charts = final_report.get_top_playlists(playlist_ids, spotify)
print("Total songs: " + str(len(top_charts.index)))
top_charts.head()
The output of the above code

Getting aritsts DoB and approximate age — Additional data from MusicBrainz

Now that we have names of the most popular artists, we need to find their date of birth. We discovered an API called ‘musicbrainzngs’. With this library we can retrieve all kinds of music metadata from the MusicBrainz database. We obtain the artists birth dates by extracting the names of the artists from our Spotify dataframe, dropping any duplicates, and querying those names with the musicbrainzngs ‘search_artist’ function which returns a dictionary that contains the artists birth date. After extracting those birth dates, we calculated their current age and added it to the dataframe. However, we had difficultly calculating the age of the artists that are apart of a group, so we decided to drop all the groups.

artists = final_report.get_unique_artists(top_charts)
artists = final_report.get_dobs(artists) # no dob for bands
artists = artists.dropna()
artists['age'] = artists['dob'].apply(lambda x: final_report.calculate_age(x))
artists.head()
The output of the code above

Exploratory data analysis (EDA)

Spotify API data

Our data comes from two sources and combined into one when needed. The majority of data comes from Spotify. After dropping the unnecessary columns we get the following data

top_charts.info()

MusicBrainz API data

artists.info()
artists.drop(artists[artists.age < 15].index, inplace=True)
print("The average age in the charts is " + str(artists['age'].mean()))
print("The max age in the charts is " + str(artists['age'].max()) + " and the minimum age is " + str(artists['age'].min()))
print("The standard deviation is " + str(artists['age'].std()) + " and variance is " + str(artists['age'].var()))

The most popular artists of the last 12 years by number of appearances in charts

As a part of our exploratory data analysis, we got interested in what artists featured in charts the most over the last 12 years. We aggregated our top_charts dataframe by artists and plotted by number of appearances. As expected there aren’t many young artists here since the longevity of their careers are shorter than those of older artists. Actually most of these artists being in their 30’s now.

charts = top_charts.copy()
charts = charts.explode('artists')
final_report.plot_most_frequented_with_age(charts, artists)

Number of songs released by age groups over the span of 12 years

To get a better understanding of when music stars reach the end of their careers, we took our top_charts dataframe and added two attributes ‘age group’ and ‘release date’ to aggregate artists by their respective age groups and the year they released their track. We defined our age groups 20–24 being early their early twenties and 25–29 being their late twenties and so on. We can see that in the late thirties there is already a small amount of songs released and featured in top charts by artists as these artists either stopped making songs or their songs did not get into the top charts anymore. So in many cases, music stars reach the end of their careers after the age 35.

charts = top_charts.copy()
final_report.plot_age_groups(charts, artists)

Ages of artists when appearing in the top charts

Furthermore, to help determine when the artists peak ages are, we looked at all the songs that were released in the last 12 years and plotted at what ages their artists released them. We can see that the most common age is 26, and then the count is evenly distributed around that age. However, there are still more songs released after the age of 26 than before.

charts =  top_charts.copy()
final_report.plot_chart_aged(charts, artists)

Peak ages

Now let’s look at the peak ages of the aritst. First let’s only keep the artists who were featured in charts at least 3 times, in order to not analyze artists who have too little data about them. Then find their most popular track of all times and drop the rest.

df = pd.read_csv('tracks.csv')
until_2021 = df[df['chart_year'] < 2021]
names = until_2021[['artist_name']]
names = names.pivot_table(columns=['artist_name'], aggfunc='size').reset_index().rename(columns={0: 'count'})
names = names[names['count'] > 2]

until_2021 = until_2021.merge(names, on='artist_name').drop(columns=['count'])
most_popular_songs = until_2021.groupby('artist_name')['popularity'].max().to_frame()
most_popular_songs = until_2021.merge(most_popular_songs,
on=['artist_name', 'popularity']).drop_duplicates(subset=['artist_name', 'name'])
most_popular_songs['release_age'].mean()

Plot the peak ages

Now let’s see the distribution of the age and popularity, only considering the most popular song for each author

# make 2d plot of popularity and release age
figure(figsize=(5, 5), dpi=80)
fig = plt.scatter(x=most_popular_songs['release_age'], y=most_popular_songs['popularity'])
plt.xlabel("Age", labelpad=14)
plt.ylabel("Song popularity", labelpad=14)
plt.title("There's still a significant number of aritsts that made their most popular song after 30")

Machine Learning

Classifier: age group of an artist at the year of release

Using spotify’s audio features for each track decide which artist’s age category the track belongs to. Categories are defined the same as in the graph with age categories eg: Early Twenties, Late Twenties, …

ml_age.run_ml()

PCA, LRModel, MLRModel, LRModel with 2 best features from results of the PCA above.

Predict the popularity using audio features, collab feature and genre

import ml_pca

projectionsDf, y_data = ml_pca.run()
ml_pca.additional(projectionsDf, y_data)

Hypothesis testing

Now let’s perform a hypothesis testing, in order to confirm at what age the artists acually peak. Our team voted and decided that the peak age for artists would probably be around 27. Let’s test it!

# make a p-value t-test for our hypothesis# one-tailed alpha value with 120 degrees of freedom
significance = {
0.05: 1.658,
0.025: 1.980
}
hypothesis_mean = 27.0
sample_mean = most_popular_songs['release_age'].mean()
standard_deviation = most_popular_songs['release_age'].std()
sample_size = most_popular_songs.shape[0]
print("Sample size: " + str(sample_size))
t = (sample_mean-hypothesis_mean)/(standard_deviation/np.sqrt(sample_size))
print("T-value: " + str(t))
print("At significance-level 0.05 the null hypothesis is " + ("not " if t < significance[0.05] else "") + "rejected")
print("At significance-level 0.025 the null hypothesis is " + ("not " if t < significance[0.025] else "") + "rejected")

Conclusion

We were able to disprove the statististics given by the Washington Post article. We learned that the peak age for music aritsts is about 26–29 years old. However, there is still a significant number of artists that peak earlier or later. We also learned that the careers of musical aritsts rarely survive after 35. The majority of them either stop making new songs, or their songs rarely reach the top charts. Current starlets might use this information to plan their careers and see what they can expect in the future.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store