Sunday, November 15, 2009

Reading Notes November 17

Shreeves, S. L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.

This text was not easy to understand, but here are some excerpts:

"The mission of the Open Archives Initiative (...) is to "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content" (Open Archives Initiative, n.d. a). The Protocol for Metadata Harvesting, a tool developed through the OAI, facilitates interoperability between disparate and diverse collections of metadata through a relatively simple protocol based on common standards (XML, HTTP, and Dublin Core)."

"The OAI protocol requires that data providers expose metadata in at least unqualified Dublin Core; however, the use of other metadata schmas is possible and encouraged. The protocol can provide access to parts of the "invisible Web" that are not easily accessible to search engines
(such as resources within databases) (Sherman & Price, 2003) and can provide ways for communities of interest to aggregate resources from geographically diffuse collections."

-- The data is searchable and browsable without any manual cataloging of the various OAI repositories.

-- I could not pull up the text on how search engines work today, and will continue the reading tomorrow, once I get access to another computer.

White Paper: The Deep Web: Surfacing Hidden Value
Michael K. Bergman


Journal of Electronic Publishing, vol. 7, no. 1, August, 2001
DOI: http://dx.doi.org/10.3998/3336451.0007.104


-- This article is very enlightening -- it covers a technology (DeepPlanet, which can search the deep web)

-Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it-- it si about 500 times the size of the surface web

--Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not "see" or retrieve content in the deep Web

--The deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request.

-- The deep web is a very coveted commodity

-- Study by NEC research initiative that search engine sby google and northern light only crawl 16% of the web's content

--Search engines obtain their listings in two ways: Authors may submit their own Web pages, or the search engines "crawl" or "spider" documents by following one hypertext link to another. The latter returns the bulk of the listings. Crawlers work by recording every hypertext link in every page they index crawling.

-- The crawls used to be indiscriminate, but "the most recent generation of search engines (notably Google) have replaced the random link-following approach with directed crawling and indexing based on the "popularity" of pages. In this approach, documents more frequently cross-referenced than other documents are given priority both for crawling and in the presentation of results. This approach provides superior results when simple queries are issued, but exacerbates the tendency to overlook documents with few links."

__ the problem here: without a linkage from another Web document, a page will never be discovered.

-- They don't use the term invisible web -- it is not invisible, but rather unindexable

-- the article continues to describe the study in more detail

--it is impossible to completely index the deep content, but new technologies need to be developed to search the complete web

Searching must evolve to encompass the complete Web.

No comments: