Code4Lib2010 Notes from Day 3

Keynote #2:  Paul Jones

catfish, cthulhu, code, clouds and levenshtein cloud (what is the levenshtein distance between Cathy Marshall and Paul Jones?)

Brains: create the code using certain assumptions about the recipient, which may or may not be accurate

Brains map based on how they are used.

(images showing difference in brain usage between users who are Internet Naive, and Internet Savvy: for Internet Naive, “reading text” image closely matches “Internet searching” image, but is very different for the Internet Savvy image)

Robin Dunbar: anthropologist who studies apes & monkeys.  Grooming, gossip, and the evolution of language (book).  Neocortex ratio (ability to retain social relationship memory).  “Grooming” in groups maintains relationship memory.  In humans, the relationship can be maintained long distance via communication (“gossip”).  Dunbar circles of intimacy; lower the number, higher the “intimacy” relationship.

Who do you trust and how do you trust them?

Attributed source matters (S.S. Sundar, & C. Nass: “Source effects in users’ perception of online news”).  Preference for source: 1.others,, 3.self, editor; quality preference: 1.others,, editors, 4.self (others=random preset; computer=according to computer generated profile; self=according to psychological profile)

Significance:  small talk leads to big talk, which leads to trust.

Small talk “helps” big talk when there is: 1. likeness (homophily) 2. grooming and gossip 3. peer to peer in an informal setting 4. narrative over instruction

Perception of american nerds problems: 1.ADD 2. asperger (inability to get visual emotional cues) 3.hyperliterality, & jargon of the tasks and games 4. friendless, 5. idiocentric humor(but this gets engineered away?)

but they: 1.multitask, 2. use text based interactions (visual cues become emoticons), 3.mainstream jargon into slang, 4. redefine friendship 5. use the power of Internet memes, shared mindspace

The gossipy part of social interactions around information is what makes it accessible, memorable and actionable.  But it’s the part we strip out.

Lightning Talks

Batch OCR with open source tools

Tesseract (no layout analysis) & Ocropus (includes layout analysis): both in google code (python script builds a PDF file from an image) for code

VuFind at Western Mich.U

Uses marc 005 field to determine new book (now -5 days, now -14 days, etc.)

Please clean my data

Cleaning harvested metadata & cleaning ocr text

Transformation and translation steps added to the harvesting process (all metadata records are in xml); regex as part of step templates; velocity template variables store each regex (gives direct access to the xml elements, then use the java dom4j api to do effectively whatever we wish)

Using newspapers:  example of “fix this text” (within browser).  Cool

Who the heck uses Fedora disseminators anyway?

Fedora content models are just content streams.

(They put their stuff in Drupal.)

Disseminator lives in Fedora, extended with PHP to display in Drupal, & edit

Library a la carte Update

Ajax tool to create library guides, open source style

Built on building blocks (reuse, copy, share)

Every guide type has a portal page (e.g., subject guide, tutorial (new), quiz)

Local install or hosted: tech stack: ruby, gems, rails, database, web server

Open source evangelism:  open source instances in the cloud!

Digital Video Made Easier

Small shop needed to set up video services quickly.  Solution: use online video services for ingest, data store, metadata, file conversion, distribution, video player.

Put 3 demos in place (will be posted on twitter stream): api, youtube api

Disadvantages: data in the cloud, Terms of service, api lag, varying support (youtube supportive, blip not so much)


Tool to help students find physical space more easily.

Launched oct. 2009, approx 65 posts per week.

php + MySql + jQuery

creativecommons licensed

EAD, Apis & Cooliris

Limited by contentdm (sympathy from audience), but tricked out to integrate cooliris.

(Pretty slick)


You Either Surf or You Fight: Integrating Library Services With Google Wave

Why?  go where your users are

Wave apps: gadgets & robots (real time interaction)

Real time interaction

Google has libraries in java and python, deployed using google app engine (google’s cloud computing platform.  Free for up to 1.3 million requests per day)

Create an app at; Get app engine SDK, which includes the app engine launcher (creates skeleton application & deploys into app engine)

Set up the app.yaml (give app a name, version number [really important]).  api version is the api version of app engine.  Handler url is /_wave/.*

Get the wave robot api (not really stable yet, but shouldn’t be a problem) & drop it into the app directory

Wave concepts:  wavelet is the current conversation taking place within google wave; blip is each message part of the wavelet (hierarchical); each (wavelet & blip) have unique identifiers, which can be programmatically addressed.

Code avail on

External libs are ok – just drop in the proj directory.  Using beautifulsoup to scrape the OPAC

Google wave will do html content, sort of.  Use OpBuilder instead (but it is not well documented).  Doesn’t like CSS

For debugging, use logging library (one of the better parts)

Comments are closed.