Council Session: Web Harvesting (notes)
Matt Landgraf & Kathryn Brazee
10.23.2006
- discover and assess if the information is within GPO's scope
- discovery: web crawler technologies
- fugitive documents: major & growing problem
- has been (@ GPO)
- manual harvesting
- semi-menual harvesting
- find EPA publications & associated metadata
- any language/format/location
- excludes internal/classified documents
- includes publications from contractor grant
- not those posted on unofficial websites
both vendors used their own crawlers & filtering algorithms to determine scope
contractor #1: IIA (Information International Associates)
- close relationship with EPA
- asked about scope & examples of in-scope documents
- substantial amount of analysis
- 2/3 of first crawl rules are portable to other agencies
- __positive rules__ were more effective than __negative rules__
contractor #2: Blue Angel
- divided into categories
- those categories that are not in-scope
- have a list of documents that are out of scope
- database of in-scope & out-of-scope publications
- for list: help further rules testing
- how to distinguish "draft," unofficial EPA documents, etc.
- presentation given by agency members?
- seems counter-intuitive to have Blue Angel depicted in red on the graph...
slide #17
- what fraction of the crawled documents are already cataloged by GPO?
- 3rd crawl had an automated tool to determine this
- still evaluating those results for 1st & 2nd crawl
need to analyze and see how much overlap there was between the two pilots (IIA/Blue Angel)
- 1:1
- how deep did each vendor get into the EPA site?
fugitive documents (to GPO) estimated at 50-75%
there will be some human follow-up to determine accuracy
assumption #1
- depends on success rate that minimalizes analysis by GPO staff
- don't have all the results (disc. quest. #2)
- how much better is it than what we've got now?
what is an acceptable level?
- walt - out-of-scope #'s are unacceptable, but seems due to the approach chosen
- "only" vs. "mostly" (only these documents--may be missing some, or mostly these documents--and some others)
- ann - says "mostly"
- like selecting General Publications
- err on the side of inclusiveness
amount of time it would take to track down and find out-of-scope documents
geoffrey - aren't there policy consequences to getting out-of-scope documents? at what point does it become a problem?
some out-of-scope documents weren't government documents or were state documents
re-harvest and find what has changed?
- will probably be incorporated into another pilot
EPA site = about 23 separate sites
- so didn't spend 3 weeks on just one site
what other avenues?
concern about not being intrusive / taking up bandwidth
- received parameters from EPA webmaster
any discussion with NTIS?
definition of what the scope is -- can it be shared with the public?
future test websites?
Back to Conference & Seminar Notes
Comments (0)
You don't have permission to comment on this page.