| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

FDLP2006WebHarvesting

Page history last edited by PBworks 17 years, 5 months ago

FDLP 2006

Council Session: Web Harvesting (notes)

Matt Landgraf & Kathryn Brazee

10.23.2006

 

These are my notes. For the actual presentation, see:

* presentation summary (adapted from powerpoint file)

* download PowerPoint file

* download handout (pdf)

 

  • discover and assess if the information is within GPO's scope
    • discovery: web crawler technologies
    • fugitive documents: major & growing problem

 

  • has been (@ GPO)
    • manual harvesting
    • semi-menual harvesting

 

  • find EPA publications & associated metadata
    • any language/format/location
    • excludes internal/classified documents
    • includes publications from contractor grant
    • not those posted on unofficial websites

 

both vendors used their own crawlers & filtering algorithms to determine scope

 

contractor #1: IIA (Information International Associates)

  • close relationship with EPA
  • asked about scope & examples of in-scope documents
  • substantial amount of analysis
  • 2/3 of first crawl rules are portable to other agencies
  • __positive rules__ were more effective than __negative rules__

 

contractor #2: Blue Angel

  • divided into categories
    • those categories that are not in-scope
  • have a list of documents that are out of scope
    • more by Blue Angel

 

  • database of in-scope & out-of-scope publications
    • for list: help further rules testing
    • how to distinguish "draft," unofficial EPA documents, etc.
    • presentation given by agency members?
    • seems counter-intuitive to have Blue Angel depicted in red on the graph...

 

slide #17

  • contract or in-house?

 

  • what fraction of the crawled documents are already cataloged by GPO?
    • 3rd crawl had an automated tool to determine this
    • still evaluating those results for 1st & 2nd crawl

 

need to analyze and see how much overlap there was between the two pilots (IIA/Blue Angel)

  • 1:1
  • how deep did each vendor get into the EPA site?

 

fugitive documents (to GPO) estimated at 50-75%

there will be some human follow-up to determine accuracy

 

assumption #1

  • depends on success rate that minimalizes analysis by GPO staff
  • don't have all the results (disc. quest. #2)
  • how much better is it than what we've got now?

 

what is an acceptable level?

  • walt - out-of-scope #'s are unacceptable, but seems due to the approach chosen
  • "only" vs. "mostly" (only these documents--may be missing some, or mostly these documents--and some others)
  • ann - says "mostly"
    • like selecting General Publications
    • err on the side of inclusiveness

 

amount of time it would take to track down and find out-of-scope documents

  • may not be worth it

 

geoffrey - aren't there policy consequences to getting out-of-scope documents? at what point does it become a problem?

 

some out-of-scope documents weren't government documents or were state documents

 

re-harvest and find what has changed?

  • will probably be incorporated into another pilot

 

EPA site = about 23 separate sites

  • so didn't spend 3 weeks on just one site

 

what other avenues?

 

concern about not being intrusive / taking up bandwidth

  • received parameters from EPA webmaster

 

any discussion with NTIS?

  • yes

 

definition of what the scope is -- can it be shared with the public?

  • too nuanced a definition

 

future test websites?


Back to Conference & Seminar Notes

Comments (0)

You don't have permission to comment on this page.