My vignette is about text mining and analysis, utilising the tm and topicmodels packages in R and Latent Dirichlet Allocation, to work out what the documents are written about without having to read them all!
The vignette shows you how to create a Document-Term Matrix, then uses LDA to work out what key themes are present in a body of documents (called a corpus) and assigns each document to the topics, with varying probabilities for each topic.
This tool can help a user find a relevant document without having to search for it by name, or even knowing what it was written about!
Anyway, here is the link to my vignette:
I hope you find it useful.
Interesting blog and covers an area I have looked at before. I wonder if it is possible to do this with PDF files, as I was faced with this issue at work and resorted to VBA to get through thousands of documents, but was pulling out numerical data rather than text. Using R may well have been another option and one long rainy day I might give it a try.
I found your vignette a great summary to start working with topic modeling. I am still getting my head around this, but your post cleared some confusion for me as to “why” and then “how” we do certain things to the data to achieve the Topic models.
I’m still not quite making the connection with the probability calculations and what they mean, but I know practice will solve this.
Good Topic! – pun intended.
its basically just saying that a certain document is 30% about Topic 1, 20% about Topic 2 etc etc summing to 100%.
I need to do more work to visualise that but ran out of time basically!
Leave a comment