Text Clustering analysis usually involves the Text Mining process to turn text into structured data for analysis, via application of natural language processing (NLP) and analytical methods.
In this post it is described the process to classify and visualize meaningful textual contents of European Union projects into topics clusters.
The steps involved for this process are:
- Problems definition and Identify text to be collected.
- Pre- processig Text
- Feature extraction
- Build model and Evaluation
Problems definition and Identify text to be collected.
The dataset involved in this project contains concrete projects funded by European Union downloaded from EU Open Data Portal (http://data.europa.eu/88u/dataset/eu-results-projects) . The data include the description’s project that I use to classify text into topic clusters.
I worked through the important basic operations for cleaning and analyzing text data with Python package spaCy (https://spacy.io/). The functions are:
- Generate lemmas and converting lamma into lower case;
- Remove punctuation;
- Removing stopwords : Words that occur extremely commonly (Eg.articles, verbs, pronouns, etc.);
- Removing special characters(numbers,emojis, etc)
- Removing HTML/XMLtags
I organized data into structured matrix TF-IDF (Term Frequency-Inverse Document Frequency).
In the TF-IDF matrix the weight of a term in a document is proportional to its frequency and inverse function of the number of documents in which it occurs.
The overall effect of this weighting matrix is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents.
In the code below, I pre-process the description dataframe field and create a TDIF matrix with
Build model and evaluation
K-means clustering is an unsupervised machine learning algorithm that allows to classify the nature of unlabeled data into groups. K-Means clustering intends to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
The best number of clusters k leading to the greatest separation (distance). There is no right answer in terms of the number of clusters that we should have in any problem and sometimes domain knowledge and data analysis experience may help.
Generally ,There are two metrics that may give us some intuition about k :
- Elbow method: This method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids;
- Silhouette analysis: This technique provides a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
Statistical methods results and the domain knowledge analysis led me to choose k equal to 5.
After choosing the best k, we can call the fit method of the KMeans model passing the TF-IDF matrix. This method, fit the model to the data by locating and remembering the regions where the different cluster occur.
After fitting the model, I used joblib.dump to pickle the model.
I used Dash —(Open Source Python library built on top of Flask and Plotly.js) for creating an Analytical Dashboard application to visualize insight text clustering results.
After reloaded the model, we can use the predict method on TF-IDF matrix previously created . This returns a cluster label for each sample, indicating to which cluster a sample belongs.
Analysing the resulting clusters, I assigned the followed labels to each cluster:
I used a hierarchical visualization Treemap chart to show the top 30 terms for each thematic cluster. Box sizes indicate frequenzy word sizing, and color indicates average word frequenzy group.
To drill down further into this analysis, I used a sunburst diagram and a scatter plot diagram to visualize the hierarchy and distribution of project data grouped by cluster and the owner nation.
The project can be downloaded on GitHub: