Revealing Invisible Science

Revealing the Invisible Web with CCReSD

The notion of the Invisible Web created quite a buzz, long before Google even had just one “oo” let alone half a dozen. The phrase alluded to the putatively millions of additional web pages, essentially hidden from view behind database scripts – fascinating product catalogues, riveting company backend data, and, scientific databases.

Scientific databases, you say, invisible?

Of course! You probably think of the databases with which you are personally familiar as being directly accessible and that there is nothing hidden about their contents at all. Much of the search functionality of countless scientific databases will work perfectly well regardless of your IP address, irrespective of whether you have logged in, and from almost anywhere in the world. Some are closed off to non-subscribers or those outside a particular campus or organisation, of course, but many are not. So, by what stretch of the imagination might they be described as hidden, or worse, invisible?

Well, do you know precisely what is contained in the close to 1000 terabytes of information in the National Climatic Data Centre? What about your favourite literature database? What about PubMed or ChemSpider? Or, any of dozens and dozens of other databases hidden by virtue of their very nature from conventional search engines. Obviously, specific users will have a relatively detailed perspective of the contents of a particular database, but what about cross-disciplines or, perish the thought, lay outsiders who may need to access information quickly without spending hours, days, weeks, attempting to find the right database and then attempting to figure out what is in it?

Yih-Ling Hedley and Anne James of the Faculty of Engineering and Computing at Coventry University, and Muhammad Younas of the Department of Computing at Oxford Brookes University, Oxford, England, point out that invisible web databases dynamically generate results in response to users’ queries. And, therein lies the rub. Search engines, which traditionally crawl, spider and index, the web, see only the front-end search page when they visit a site acting as a user interface for a database, in general. This means that the actual keywords associated with the data within those databases is not accessed, because it is dynamically generated by real users, and is not rendered by the search engine robots

Nevertheless, Hedley and colleagues say, “The categorisation of such databases into a category scheme has been widely employed in information searches,” but with only limited success. Now, the team has developed and tested a Concept-based Categorisation over Refined Sampled Documents (CCReSD) approach that effectively handles information extraction, summarisation and categorisation of such databases. Unlike a conventional search engine, CCReSD behaves in some ways like a real live user and detects and extracts query-related information from sampled documents of databases.

The result is that the system can generate a table of keyword terms and their frequencies to summarise database contents. The team explains that their system also generates descriptions of concepts from their coverage and specificity given in a category scheme.

Okay, sounds useful, CCReSD is basically a database savvy search engine spider that can create an index from otherwise hidden web resources by spoofing the behaviour of a genuine human user of that database. Aside from the potential breaching of database terms & conditions that forbid automated accesses, this could be a potentially very useful tool for technical subjects that have many, many hidden databases.

The team tested their system on the Help Site database (computer manuals on a system with multiple templates), CHID (a healthcare database with a single template) and the general database-driven site Wired News (single template). They found that it could extract relevant information from sampled documents and generate terms and frequencies with improved accuracy on previous approaches.

The team discusses CCReSD in detail in the Int J High Performance Computing Networking, 2007, 5, 24-33