Posted by Tris Shores

Project Intro

Goal

To develop a cataloging platform that reduces the cost & complexity of original bibliographic cataloging.

Result

An experimental cataloging platform that:

  • Leverages predictive-algorithms and crowdsourced-metadata to significantly accelerate cataloging.
  • Has a assistive cataloger-interface with built-in validation.
  • Supports original, copy-modify, and copy cataloging.
  • Supports minimal-level, optionally with LCSH, pre-publication and full-level records.
  • Saves the raw bibliographic data entered by catalogers allowing rapid high-level edits and regeneration of MARC/BIBFRAME records.
  • Is pre-trained and continues to evolve predictive-capability as materials are cataloged.


Project Software

The cataloging platform has these components:

PredictiveBIB

An experimental cloud-connected software cataloging platform that uses predictive-algorithms and crowdsourced-metadata, to simplify & accelerate creation of bibliographic records. PredictiveBIB’s core innovation is the use of algorithms to process metadata in order to predict subject headings for the item being cataloged. This allows catalogers to add auto-suggested subject headings in seconds. Various other algorithms are used to auto-suggest LC genre/form terms, name authorities, title authorities, and publishers whenever possible. Text analytics and other optimizations are used to simplify cataloging.

Cataloger-interface development started with the premise that catalogers should be freed up to focus solely on metadata, not record formatting. PredictiveBIB’s rich & responsive desktop app interface is considerably abstracted from the record format and looks nothing like a web form or MARC fields/subfields. It dynamically analyzes entered bibliographic metadata and delivers an intuitive shortest-path to completion. Validation is built-in to help maintain consistent quality. Text analysis techniques, such as named-entity recognition, spell-checking, casing & article checks, are utilized.

PredictiveBIB seeks to balance ease of cataloging and bibliographic record completeness. It supports minimal-level (optionally with LCSH), pre-publication, and full-level records. Once a level is selected the cataloger is prompted to enter only essential bibliographic data. To shorten development time some materials are not yet supported, such as biographies and textbooks.

To get an idea of PredictiveBIB capabilities, review the sample records. These were generated by PredictiveBIB from bibliographic metadata entered in the app. Included are a diverse selection of English language, bilingual, multilingual, and foreign language books (thanks to Google Translate).

The only subsequent modification to the sample records was assignment of an artificial creation date to facilitate regression testing. ModMARC was not used on any of the sample records. All those records were created through cataloging of books borrowed from public lending libraries. Every effort was made to avoid proprietary metadata & databases; FAST subject headings were not used even though they are good candidates for prediction and could readily be supported.

The quickest way to learn PredictiveBIB is to open sample records in the desktop app component and see how the book metadata is entered on the various app pages. Any of the sample records can be loaded into PredictiveBIB by typing the first letter of a title in a field on the app’s start page and selecting from an auto-suggest list; the MARC21/BIBFRAME record is generated with one more mouse-click; then viewed in your text editor, or ViewMARC, or ModMARC with one further mouse-click.

PredictiveBIB generates MARC (.mrc), MARC XML (.xml), human-readable MARC (.txt), and BIBFRAME (.rdf.xml) records in the cloud, then saves them locally to the cataloger’s computer (or network share). PredictiveBIB additionally generates an intermediate data file (.metaxml) that contains all cataloger-supplied bibliographic metadata in case the cataloger wants to modify metadata in PredictiveBIB in order to regenerate their bibliographic records. This approach allows libraries to rapidly edit or enrich previously created records.

Generated MARC/BIBFRAME records are by default not saved in the cloud and are private to the authoring library. Optionally, PredictiveBIB supports addition of a CC0 statement and upload to a public-domain repository (CRESS).

Cloud Predictive Algorithms

PredictiveBIB algorithms consume bibliographic metadata entered by catalogers during use of the software, making the metadata supply self-sustaining. The platform is pre-trained on books cataloged to date. Predictive-capability will expand into other subject areas as community cataloging progresses in those areas. Libraries intending to use PredictiveBIB will need to consent to the collection & use of all entered bibliographic metadata. The bibliographic data extracted through data-mining is anonymous, unless you happen to be the book’s author, contributor, publisher, etc. Generated MARC/BIBFRAME records are by default not saved in the cloud and are private to the authoring library, although libraries can elect to upload them to a connected repository.

PredictiveBIB uses several algorithms to extract & analyze bibliographic metadata entered by the user-community in order to predict subject headings. Algorithms:

  • Use word, synonym, and relevance analysis to assess & rank correlation strength between bibliographic metadata and subject headings.
  • Assign weightings to subject headings based on usage frequency.
  • Act on different sets of subject headings, for example:
    • Subject headings used by a library.
    • Subject headings used by all participant libraries.
    • 4XX and 5XX referenced subject headings within subject heading authority records.
  • Act on different sets of bibliographic metadata, for example:
    • Metadata entered by a library.
    • Metadata entered by all participant libraries.
    • Metadata from various sources, whether crowdsourced or archived, extracted from bibliographic records or otherwise.
  • Execute on different triggers:
    • Event triggers, such as when subject heading selection is required to complete a bibliographic record, or on creation of bibliographic records.
    • Periodic triggers, such as a daily update of subject heading usage statistics for all participant libraries.
  • Interact with datasets of associations that are used to predict subject headings for new materials being cataloged.
  • Are independent of bibliographic record format, material-type cataloged, language, subject area, or subject heading authority type.
  • Can be implemented as part of a desktop cataloging tool, browser-based cataloging tool, REST API, or web-service, either integrated with other cataloging functionality or as a dedicated prediction tool.
  • Use linked data concepts, such as linking to subject heading authority narrower/broader terms to widen the association of subject headings with bibliographic metadata.
  • Are able to learn from cataloger supplied metadata, similar to machine learning algorithms.

As more materials are cataloged using PredictiveBIB, the associations are expanded and refined. Periodically the associations are reviewed & adjusted by a human because automated language processing is by no means infallible.

Cloud Repository & Exchange Services

Cloud Repository & Exchange Services (CRESS) is an experimental public-domain cloud repository available to PredictiveBIB users who elect to search, import, or export public-domain bibliographic records.

Libraries are welcome to host mirrors of CRESS bibliographic content to ensure data-replication, distributed access, and non-monopolization. PredictiveBIB can be adapted to connect to any online repository that does not impose bibliographic metadata usage restrictions that limit predictive functionality or innovation.

ViewMARC

Experimental desktop cataloging software that allows in-depth inspection of MARC21 fields. A very simple to use (educational) tool for both copy and original catalogers. This tool is integrated in the PredictiveBIB desktop app.

ModMARC

Experimental desktop cataloging software that allows modification of individual MARC21 records. A simple but versatile tool for original catalogers. This tool is integrated in the PredictiveBIB desktop app.


Beta Testing

PredictiveBIB is available for beta testing at organizations, such as libraries, museums, and historical societies.


Collaborative Cataloging

PredictiveBIB supports remote collaboration between catalogers. For example, an onsite cataloger enters physical book properties and a remote cataloger completes the record. Inter-cataloger notes & hyperlinks are automatically imported with each record.


Updates

  • Added support for BIBFRAME records using the Library of Congress marc2bibframe2 utility and a .NET XSLT processor.
  • Added support for MARC accessibility fields to improve discoverability of accessibility information.
  • Presented PredictiveBIB at the American Library Association Core Interest Group Virtual Week, 2021.
  • Added support for pre-publication bibliographic records.
  • Added support for accelerated e-book cataloging.


Source Code

Project software contains no third-party libraries outside of official Microsoft library packages, except the marc2bibframe2 utility published by the Library of Congress.


Future Work

  • Fix shortcomings identified during beta testing.

  • Support native linked data cataloging to generate RDF/XML records.

  • Add support for audiobooks, DVDs, textbooks, braille, and biographies. Also, other materials based on demand.

  • Add support for Children’s Subject Headings (CHS).

  • Continue algorithm R&D.

  • Apply predictive-algorithms to a readers advisory service.

  • Add further support for import of ONIX data. Currently, only basic ONIX fields are supported.


Acknowledgments

Thanks are owed to many:

  • To Alisha Taylor for setting aside valuable weekend time to teach me the art of cataloging and collaborating on research for the project.
  • To Creative Commons for public-domain tools.
  • To staff at the Library of Congress for answering critical technical questions.
  • To local public lending libraries for providing books, even during COVID lockdown.
  • To the Library of Congress for their online documentation and tutorials.
  • To Emory LaPrade for ‘Accurately Representing Diverse Human Experiences in a Rapidly Changing Vernacular’, an ALA Core presentation that inspired unofficial subject heading prediction/tracking.
  • To Teressa Keenan for ‘Harnessing the Power of MARC Data to Improve Discoverability of Accessible Materials’, an ALA Core presentation that inspired improved accessibility support.
  • To the University of Illinois at Urbana-Champaign for their open-source Metadata Maker, which hints at the possibilities of abstraction from MARC.
  • To librarians who shared their insights during a phone survey on library public-domain practices.
  • To the author of The Cataloging Calculator for permission to add a link in PredictiveBIB to that valuable resource.
  • To the author of MarcEdit for answering a technical question related to MARC validation.
  • To Microsoft for providing the open-source .NET developer platform.