Txt Mining Blog

Does your library hold unique knowledge?

At Biblio Insights we are exploring ways that libraries can use text data mining to extract new knowledge from the collections of published materials held in libraries.

Text mining, also referred to as text analytics, is the technique of automated searching and retrieval of words and phrases from large files of text data to reveal patterns and trends. In recent years the use of automated techniques has accelerated the analysis of texts for humanities and scientific researchers alike. Of course the text must be in a digital form and as many cultural institutions and libraries have recently begun to scan and convert the text in publications to digital form using ever more reliable OCR tools, we have a new opportunity to create new knowledge from these collections.

At the State Library of Queensland last year, as we were preparing the case for large scale digitization of the printed legal deposit collection, we estimated that this medium-sized print collection would yield about 400 million pages of digitised text. This is definitely considered ‘big data’ and since the publications are unique to Queensland (published in or about Queensland) we decided that this provided an opportunity to create new knowledge about Queensland using text mining tools. These tools include the techniques of opinion mining and sentiment analysis to track public and minority opinion during key periods of Queensland’s history (e.g. WWI) and also the ability to locate specific names (places, businesses and families) embedded in texts, monographs, journals and newspapers.

Whilst some libraries are using data analytics to better understand their audience and to customize their services to better engage with their communities, it is time now for libraries to consider how these data mining tools can yield new knowledge from large collections of publications previously considered too large and unwieldy.

As we explore these new tools we would be happy to hear from you about your experiences either using such tools or preparing to digitize content for this purpose.

Janette Wright