A topic model is a type of algorithm that scans a set of documents (known in the field of Natural Language Processing/NLP as a corpus), examines how words and phrases co-occur in them, and automatically “learns” groups or clusters of words that best characterize those documents. These sets of words often appear together to represent a coherent theme or topic.
At every major conference, new machine learning models are always being proposed. A fundamental question to ask is whether our current model (LDA) works better than new models like BERTopicoption. The only way to find an answer is to test these models and to study their results. So, that’s what we did.
In our previous blog “LDA and the method of topic modeling,” we described how we apply the LDA machine learning model to Metabob. In this blog post we compare the results with a new topic modeling technique called “BERTopic.”
BERTopic is a topic modeling technique that leverages a pre-trained transformer model to embed all the input documents in a multi-dimensional space. Imagine a 3-dimensional filing cabinet and you place each document at a location in space such that similar documents are near each other (that would be a 3-dimensional space). Then it finds the clusters by using HDBSCAN and an algorithm called c-TF-IDF to find the topics.
Also, BERTopic supports guided, (semi-) supervised, and dynamic topic modeling as well as visualizations that are similar to the LDA model.
Unlike LDA, which starts with a model that is not initialized, BERTopic starts with a model that is already trained. Think of studying a foreign language at age 20: you already know your native language well, so it should not take 20 more years to learn how to communicate in a new language. Bertopic starts with a model that has been already trained with a huge amount of data (for example Wikipedia) that need not be exactly the same as in your application (for example a local newspaper).
First, we tested the BERTopic model with three separate datasets.The first two are relatively common ones: 20_news_groups and trump_tweets. Since many previous models were tested on these two datasets mentioned, one of the advantages of choosing them for testing is that there are many modeling baseline records available for comparison. Thus, we can compare our BERTopic model with past models to see its performance. Next, we used our GitHub comments dataset to test BERTopic to see how topics within those comments are distributed in a vector space.
Second, we started BERTopic with an embedding model pre-trained on a large diverse dataset of over 1 billion sentence pairs: all-MiniLM-L6-v2.
In order to evaluate fairly, the testing methods for both models need to be synchronized. Thus, we need to control both the data pre-processing process and the selection of pre-trained embedding models. all-MiniLM-L6-v2 is selected here for its balanced computation time and performance.
Finally, we compare BERTopic to other Topic Models by using the OCTIS framework which calculates several different metrics including Coherence (such as c_v and c_npmi), Diversity, and Kullback-Leibler Divergence scores (think of how the scikit-learn framework facilitates cross-validations and accuracy numbers like F1-score). OCTIS also standardizes some parts of the training such as tokenizing and other preprocessing of our raw data.
We found that the BERTopic model outperforms the LDA according to the metrics in several categories, and has many advantages over the LDA model:
We will continue to explore BERTopic for our purposes and will keep testing other topic models including Top2Vec. Stay tuned for more findings and comparisons in our future posts!