Blog mining in reduced dimensions

Despite exaggerated predictions about the death of the blog, it seems numbers continue to grow. Everyone and their dog seems to want to share their innermost thoughts, echo the news and comment on everything from the Higgs Boson to the Gaga Biebon. Efficiently pulling useful information from the millions of blogs with all their various formats and multimedia content, however, is no simple task and the old-school data mining techniques do not appear to be as effective with these disparate networks as they are with standardised databases.

Writing in the International Journal of Data Mining, Modelling and Management, Flora Tsai of the School of Electrical and Electronic Engineering at Nanyang Technological University in Singapore hopes to remedy that situation. She is developing a data visualisation approach that involves reducing the number of “dimensions”, the different aspects of the collected blogs so that title, content, tags (labels and/or categories), authors (blogger or guest), URL (permalink), and time-date stamp, outbound and inbound links, optional geotag (location) are flattened into a more usefully searchable database. She explains the approach:

Dimensionality reduction is the search for a small set of features to describe a large set of observed dimensions. By performing dimensionality reduction, hidden structure can be uncovered that aids in the understanding as well as visualisation of the data.

The reduction in dimensions does not discard information it simply projects it on to a system that can be examined more easily and so allows the data miner to find useful patterns and information. What dimensionality reduction can do is quickly remove noise words and spurious “tags”, so that large data sets, although probably not the whole blogosphere, could be analysed quickly.

Research Blogging IconTsai, F.S. (2012). Dimensionality reduction framework for blog mining and visualisation, International Journal of Data Mining, Modelling and Management, 4 (3) DOI: 10.1504/IJDMMM.2012.048108