top of page

Hierarchical Clustering on Spotify Audio Features

Using Spotify audio features, we clustered years (1921–2020) to see whether music history naturally breaks into distinct “eras” based on how songs sound. Hierarchical clustering revealed three clear periods (Early, Mid, Modern) with noticeably different energy, acousticness, loudness, danceability, and popularity profiles.

This project investigated the existence of groups of years with similar audio characteristics in Spotify data using hierarchical clustering, and what that means about musical style evolution.


The dataset contains 100 observations, one per year from 1921 to 2020, where each observation is an aggregated summary of audio features generated by the Spotify API. Features include acousticness, danceability, energy, instrumentalness, liveness, speechiness, valence, tempo, loudness, duration, and popularity, all on mixed scales.


Data transformation + methods

There were no missing values, numeric features were standardized to z-scores to make scales comparable, and we excluded “year” from clustering so it stayed for interpretation rather than influencing distance. We also log-transformed speechiness to reduce skew, while keeping other variables unchanged due to low outlier influence.


For clustering, we chose hierarchical clustering because we didn’t know the “right” number of eras upfront, and the dendrogram lets you see both big eras and transitional sub-structures.


We used Euclidean distance and Ward’s linkage for the clustering and to pick the number of clusters, we reviewed silhouette and elbow plots. The silhouette peaked at k=2, but we selected k=3 because it gave a more interpretable “Early / Mid / Modern” structure.


Results and Conclusions

Two relationships helped explain why certain years grouped together: energy and loudness move strongly together, and acousticness tends to move opposite to energy.


The clustering supported three eras that were also coherent by decade distribution and feature heatmaps:


  • Early (1920s–1930s): high acousticness and valence, and the lowest energy, tempo, and popularity.

  • Mid (1940s–1950s): moderate acousticness, with danceability and popularity higher than the early era but lower than modern.

  • Modern (1960s–present): lowest acousticness and highest energy, danceability, and popularity, with higher loudness and shorter durations.

Results show “old vs new music” as measurable audio shifts you can act on, such as segmentation for nostalgia-driven playlists, catalog marketing by sound-profile, and better curation rules when you want a smooth transition playlist that moves across time.


My takeaways

This was my second ML project so it was pretty basic but definitly a cool glimpse into the kind of work that involves ML even if at first glance it feels like it wouldnt be involved. It was also a good reminder that unsupervised learning can still tell a clean story when you set it up carefully. Hierarchical clustering was especially useful because the dendrogram presents a visual and tangible structure of distinct groups.


However, the cleaned dataset was only 100 yearly aggregates which hid within-year variation, and Ward’s method is greedy, meaning early merge decisions can’t be undone. Cluster boundaries also shift if you change transformations or add features, and even the choice of k=3 had a subjective element. There is scope to imporve the study by moving to song-level data and adding richer signals like genre and/or lyrics, then validating with alternative methods and domain knowledge.


You can check out the entire project on my GitHub here.

bottom of page