Cluster analysis – grouping similar objects together – is frequently used in data analysis and visualization. I’ve been seeing various types of clustering used for finding relevant documents more and more, both for visualizing document sets (essentially providing a landscape for the user to explore) and for finding related documents (essentially “more like this”). But, these clustering methods may not be giving you what you expect!
First, the good news. Modern artificial intelligence methods can direct you to relevant documents with much greater precision and recall than clustering. So, if you are using an application like Qinsight™, you’re in good shape. But, if you are stuck with generic document clustering, here is what you need to look out for.
There are many different types of clustering algorithms including hierarchical clustering (generating a tree structure), centroid-based clustering (for example, k-means), and even fuzzy methods wherein one object may belong, to varying degrees, to more than one cluster. Whatever algorithm you use for clustering (or was used by someone to generate the lovely visualization you are looking at), central to the approach is the selection of features (or parameters) on which to base the clustering. And, therein lies the rub.
To explain features, consider the problem of organizing people into different groups. Without specific guidance, how would you do this? Do you use, say, height and weight? But, what if you are interested in their food preference? Or, specific gene mutations? You can easily see that, unless you know enough about the goal or the questions that need answering, you simply cannot approach the problem generically and get a meaningful outcome.
The need for determining relevant features is just as important when clustering text documents. One simplistic approach that I used extensively in the past (before enlightenment) was to determine the terms that were statistically topical. You can think about this simply as: “If the term is found in too high or too low a percentage of documents, then it won’t be useful for clustering”. More accurately, topicality is based on the degree of non-randomness. This topical approach can provide a nice starting point for general considerations, but fails when the user’s interests are not brought to the forefront.
I have also seen clustering of biomedical text documents based on predefined concept lists such as MeSH headings or Gene Ontology categories. And, citations have been used in some cases (although one must be very careful not to consider all citations to be of equal value!). In some cases, each of these methods can provide a more focused view than the topicality method, but they still fail in directly answering the user’s needs.
The issues with clustering boil down to two problems:
- Recognizing the user’s goal so that you can define the subset of features important to that user.
- Recognizing that a document that simply includes a feature of interest is insufficient. That feature must be in a context that is relevant to the user’s current needs.
Do not rely on clustering results blindly; if the underlying methods are not smart enough, you can easily be misled.
The old adage
A picture is worth a thousand words.
needs a slight revision:
The right picture is worth a thousand words.