The Book of Proverbs is packed full of wisdom. If you've given it a read, you'll notice it's not exactly a page turner in the typical sense. While there are distinct sections of the text such as an Introduction, Proverbs by Solomon, the Sayings of the Wise, and more, the themes and teachings can change verse by verse or chapter.
Inspired by a graphical representation of related Wikipedia topics, I wondered what neighbourhoods exist within Proverbs. So I began this project with the goal of finding a visual representation of different themes in proverbs. Luckily, I had just finished some projects on clustering techniques which spurred enough confidence to give it a shot.
Proverbs is part of the Bible's Wisdom literature and its authorship is generally attributed to King Solomon. A proverb is a short, sometimes formulaic, saying that conveys some truth through experience or common sense. This book contains a collection of these sayings in a not-so-obvious order. For example, in Chapter 12, Verse 23 goes (using a common pattern by Solomon involves introducing a character and their actions, and then describing the antithesis.)
"The hand of the diligent will rule, while the slothful will be put to forced labor"
and in Chapter 12, Verse 19 it says,
"Truthful lips endure forever, but a lying tongue is but for a moment"
This translation of the Book of Proverbs is the English Standard Version (ESV). This data was easy to find online, realtively consudered more word-for-word, and required little text cleaning. Some initial stats on this data set included:
Other than stray whitespace, I didn't end up removing puntuation since I would be using transformer models which were trained on text with these features and can reasonably understand punctuation. There are a few duplicates in Proverbs, however they were intentionally left in the dataset since a reasonable clustering algorithm should naturally group these verses in the same cluster -- a useful validation mechanishm.
The next step was to turn numbers into words. We do this because machine learning models don't understand words like humans do. They need words to be transformed into long lists of numbers, in this case each verse was turned into a list of 384 numbers (or dimensions).
One concern I had was if the brevity of verses length impact the ability of the embedding models to make accruate embeddings. I presume a general pretrained embedding model would be trained on paragraphs, which allow many more characters to express meaning. I ended up using an open source embedding model sentence-transformers/all-MiniLM-L6-v2 because it was:
The video I was inspiried by used the Leiden Algorithm to sort its network of nodes (articles) and edges (measure of similairty) iunto clusters. This algorithm essentially finds groups of nodes that are more densely pakced together than to other nodes. This forms a community. I constructed a graph by computing cosine similiarity between each verse and connecting verses with cosine similarity over a certain threshold. Luckily, there are Python libraries that implement this algorithm so the computation was pretty easy.
The algorithm identified the following communities within Proverbs:
Loading graph visualization...
This graph is fun to explore. Community 1 seems to be focused on the Wicked vs Righteous and how each character behaves. Community 12 (with only 8) features one of my favourite charcters: the sluggard. However, when I was tuning this algorithms' parameters, I noticed I would get proportioanlly large commnities such as Community 1 and 2. This is a fun visualization, but let's move on to other ways of visualizing Proverbs' themes.
Another powerful way to visualize relationships between verses is through a dendrogram - a tree-like diagram that reveals how verses cluster together at different levels of similarity. Unlike the community graph above, which shows flat groupings, a dendrogram displays the hierarchical structure of how verses relate to each other.
In this visualization, verses that are very similar merge together at the bottom (low height on the y-axis), while more distantly related groups merge higher up. This bottom-up approach uses Ward's method with Euclidean distance to build the tree structure.
The dendrogram below shows all 915 verses from Proverbs. Each colored branch represents clusters of thematically similar verses, with colors assigned based on a similarity threshold. Click on any branch to collapse or expand it, hover over verse labels to see the full text, and use the controls to navigate the tree.
Loading dendrogram visualization...
Hover over or click a node to see the verse
Zoom and pan to explore • Collapse nodes to explore simpler
Click verse nodes below to compare clusters