Code4Lib2010 Notes from Day 1, afternoon

Taking Control of Metadata with eXtensible Catalog

Open source, user-centered, “not big”

Customizable, faceted, FRBRized search interface

Automated harvesting and processing of large batches of metadata

Connectivity Tools: between XC and ILS & NCIP; allows you to get MARC sml out of the ILS

XC Metadata Services Toolkit – takes DC from repository and marcxml and sends it to drupal (toolkit)

XC = extensible catalog

Nice drupal interface (sample).  Modify PHP code to change display

Create custom web apps to browse catalog content (fill out web form), or preset search limits & customized facets

XC metadata services toolkit: cleans, aggregates, makes OAI compatible.  Tracks “predecessor-successor” records (changes)

XC OAIi toolkit transforms marc to marcxml; drivers available for Voyager, Aleph, NTT, III, III/oracle (note from twitter response: waiting to see if there is interest in SIRSI)

www.eXtensibleCatalog.org –still needs funding/community support

Matching Dirty Data, yet another wheel

Matching bib data with no common identifiers

Goal: ingest metadata and pdf’s for etd’s received in dspace

MARC data in UMI: filename & abstract; ILS marc data: OCLC #,  author, title, date type, dept., subject

Create python dictionaries, do exact & fuzzy matches.  Find the intersection of the keys and filter/refine from there

Reduce the search space (get rid of common words (not just stop words))

Jaro-Winkler algorhythm: 2 characters match if they are a reasonable distance from one another, but best for short strings

String comparison tutorial, secondstring (java text analysis library – on sourceforge), marcximil

http://snurl.com/uggtn

HIVE:  Helping Interdisciplinary Vocabulary Engineering

“Researchy” (as in not too well developed)

Problem: terms for self publishing instances.  Solution: combine terms from existing vocabularies (LCSH, MeSH, NBII Thesaurus) & compare to labeling

skos: somewhere between ontologies and cataloging

Abstract is run through HIVE, outputs extracted concepts cloud, color coded to represent catalog/ontology source.

Based on google web toolkit; currently available in googlecode

Coming soon: HIVE api, sparql, sru

http://datadryad.org

Metadata Editing, a truly extensible solution: Trident

DukeCore:  Duke U’s DC wireframe

MAP: metadata application profile: works by instructing an editor how to build a UI for editing: creates a schema neutral representation of metadata (metadata form).  Editor only needs to understand metadata form and communicate with the repository via the API

Editor/repository manager app:  built on a web services API so it doesn’t need to know what’s behind it.  uses python, django, yahoo grids, and jquery

Uses restful API

Starts in metadata schema, duke core, transforms to mdr, validations are applied to create a packet that is returned to the user interface.  On submission, it goes to mdf and then back to duke core

Metadata forms made up of field groups & elements (looks a lot like infopath form elements)

You can also have your vocabulary lists automatically updated, & built on the fly

Repository architecture:  Repo API allows editor not to have to worry about implementation.  Next level is fedora business logic & i/o layer, then fedora repository.

Solr updated in realtime

Uses jms messaging

Lightning Talks:

Forward: forward.library.wisconsin.edu

Uses blacklight (shoutout to blacklight devs).  shows librarians that are recommended for a targeted search string.

Problems: no standardization in cataloging; differing licensing

Stable (can shoot it with guns & it still runs)

RubyTool to edit MODS metadata

(using existing mods metadata)

“Opinionated XML”:  looking for feedback: http://yourmediashelf.org/blog

DAR: Digital Archaeological Record

Trying to allow archeologists to submit data sets with any encoding they want

Includes a google map on the search screen. Advances to filtering page for data results.  Tries to allow others to map their ontology to other, standardized ontologies

Hydra:  blacklight + Active Fedora + Rails

Being used by Stanford to process dissertations & theses

Hydrangea for open repositories to be released this year, using mods & DC

Why CouchDB

JSON documents, uses GET, PUT, DELETE, & POST

Stateless auto replication possible.

Includes CouchApps which live inside CouchDB (have to go get them & store them in CouchDB. Sofa outputs Atom feeds, e.g.

Leave a Reply