Identifying Digital Gems

DOI logoSciencebase readers will likely be aware that when I cite a research paper, I usually use the DOI system, the Digital Object Identifier. This acts like a redirect service taking a unique number, which might look like this assigned to each research paper by its publisher and passing it to a server that works out where the actual paper is on the web.

The DOI system has several handlers, and indeed, that’s one of its strength: it is distributed. So, as long as you have the DOI, you can use any of the handlers (dx.doi.org, http://hdl.handle.net, http://hdl.nature.com/ etc) to look up a paper of interest, e.g. http://dx.doi.org/10.1504/IJGENVI.2008.018637 will take you to a paper on water supplies on which I reported recently.

The DOI is kind of a hard-wired redirect for the actual URL of the object itself, which at the moment will be a research paper. It could, however, be any another digital object: an astronomical photograph, a chemical structure, or a genome sequence, for instance. In fact, thinking about it, a DOI could be used as a shorthand, a barcode, if you like, for whole genomes, protein libraries, databases, molecular depositions.

I’m not entirely sure why we will also need the Library of Congress permalinks, the National Institutes of Health simplified web links, as well as the likes of PURL and all those URL shortening systems like tinyURL and snipurl. A unified approach, which perhaps worked at the point of origin, the creator of the digital object, which I’ve suggested previously and coined the term PaperID, would seem so much more straightforward.

One critical aspect of the DOI is that it ties to hard, unchanging, non-dynamic links (URLs) for any given paper, or other object. Over on the CrossTech blog, Tony Hammond raises an interesting point regarding one important difference between hard and soft links and the rank that material at the end of such a link will receive in the search engines. His post discusses DOI and related systems, such as PURL (the Persistent URL system), which also uses an intermediate resolution system to find a specific object at the end of a URL. There are other systems emerging such as OpenURL and LCCN permalinks, which seek to do something similar.

However, while Google still predominates online search, hard links will be the only way for a specific digital object to be given any weight in its results page. Dynamic or soft links are discounted, or not counted at all, and so never rank in the way that material at the end of a hard link will.

Perhaps this doesn’t matter, as those scouring the literature will have their own databases to trawl that require their own ranking algorithms based on keywords chosen. But, I worry about serendipity. What of the student taking a random walk on the web for recreation or perhaps in the hope of finding an inspirational gem? If that gem is, to mix a metaphor, a moving target behind a soft link, then it is unlikely to rank in the SERPs and may never be seen.

Perhaps I’m being naive, maybe students never surf the web in this way, looking for research papers of interest. However, with multidisciplinarity increasingly necessary in many cross-disciplines it seems unlikely that gems are going to be unearthed through conventional literature searching of a parochial database that covers a limited range of journals and other resources.

facebooktwittergoogle_plusredditpinterestmail

6 thoughts on “Identifying Digital Gems

  1. Rich, I think you may be right about a lack of hard evidence. I’ve never used 302s, in 12 years of web work and do pretty well in terms of keyword ranking, although there was a time when my site was a PR8 and I did a whole lot better.

    This mentions 302s but really just says you shouldn’t use them http://www.google.com/support/webmasters/bin/answer.py?answer=40132&query=302&topic=&type=

    “but you shouldn’t use it to tell the Googlebot that a page or site has moved because Googlebot will continue to crawl and index the original location.”

    To my mind that sounds like it generates a conflict and a possible duplicate content issue with Google and as you know google hates dupe pages and filters them out in the SERPs. So maybe that’s as good a reason as any to harden up DOI and similar redirects.

  2. I think I may have seen Matt Cutts’ item on 302-301, but, like you say, he doesn’t give anything away about whether Google ranks pages at the end of a 301, except that if they’re double-checking on a time cycle then it wouldn’t make sense to do so. This is perhaps especially so because of 302 hijack issues. Although, that said I may have been hoist by my own petard there, because I think 302 hijacks do sometimes work for the hijacker and against the hijacked…

  3. David, this is an important topic that might benefit from more systematic experimentation. I’d be reluctant to conclude too much from one paper.

    I found this article from Matt Cutts:

    http://www.mattcutts.com/blog/seo-advice-discussing-302-redirects/

    I don’t think this addresses the issue of PageRank, but does give some food for thought. A casual search revealed a great deal of disagreement on whether a cross-domain 302 redirect link contributes to PageRank. Of course, only one opinion matters – Google’s – and they don’t seem to be saying much specifically.

  4. Rich, you’re right, Google’s algo is very sophisticated, but one thing most SEMs agree on, and I’m pretty sure I’ve heard Google’s spammeister Matt Cutts explain it too, is that soft redirects, i.e. non permanent (302 redirects) are not carried. If you run a server headers test on a DOI address you will get a 302. This is the response for the Sharpless DOI you cite:

    ” Server Response : dx.doi.org/10.1021/cr00032a009
    HTTP/1.1 302 Moved Temporarily
    Server: Apache-Coyote/1.1
    Location: http://pubs.acs.org/cgi-bin/doilookup/?10.1021/cr00032a009

    That “Moved Temporarily” bit means Google doesn’t bother to index the “new” location properly. It may check again, but if it sees the 302 after a particular period it will most likely give up.

    Google published guidelines on redirecting URLs. Google Information for Webmasters – Question #2: “Once your new site is live, you may wish to place a permanent redirect (using a ’301′ code in HTTP headers) on your old site to inform visitors and search engines that your site has moved.”

    I checked the ranking of the Chem Rev paper you cite, using the keyphrase “Sharpless Asymmetric Dihydroxylation”, without quotes and couldn’t find the paper on the first ten SERPs, which if people are linking to this DOI with that phrase, kind of supports what I, and hundreds of SEMs have said about non-permanent redirects.

    It could be that Google recognises DOIs, but I’ve not read that or heard that anyway. While its algo is definitely sophisticated there also has to be a balance at their end as to just how sophisticated they want to make it before its complexities begin to slow it down too much. I strongly suspect that the Google algo is always slightly less sophisticated than those who hope to second guess it believe and that this deficit is compensated by obfuscation on their part and the willing and wild theorising of those second guessers.

  5. >However, while Google still predominates online search, hard links will be the only way for a specific digital object to be given any weight in its results page. Dynamic or soft links are discounted, or not counted at all, and so never rank in the way that material at the end of a hard link will.

    David, it sounds like you’re saying that a particular paper cited around the web with a DOI-based link like:

    Sharpless Asymmetric Dihydroxylation

    will not contribute the to paper’s page rank.

    Google and other search engine services are notoriously secretive about how they rank pages; they’re at war with spammers after all. I’m curious – where is your information coming from? The author of the article you cite seems to offer no hard evidence either.

    Could it be that Google is smart enough to know a DOI when it encounters one, and handle it appropriately?

Comments are closed.