Posted by Tris S. on 12/14/20 (updated 01/17/21)

Update 01/14/21: Linked data support is planned for mid 2021.

Update 01/10/21: To assist libraries cope with a blended onsite & remote workforce, PredictiveBIB now supports collaborative cataloging.

Update 01/02/21: Pre-publication bibliographic records are now available for download, with more on the way.

Update 12/27/20: PredictiveBIB now supports accelerated e-book cataloging.

Project Intro

From January 4th 2021 onwards, libraries may participate in beta testing of experimental MARC cataloging software that:

Reduces the cost & complexity of bibliographic cataloging within public lending libraries, through a cataloging platform that:

  • Leverages predictive-algorithms and crowdsourced-metadata to significantly accelerate cataloging.
  • Has a highly assistive cataloger-interface with built-in validation.
  • Supports original, copy-modify, and copy cataloging.
  • Supports minimal-level (optionally with LCSH), pre-publication, and full-level records.
  • Saves the raw bibliographic data entered by catalogers allowing rapid high-level edits and regeneration of MARC records.
  • Is pre-trained and continues to evolve predictive-capability as materials are cataloged.


Project Software

The experimental MARC cataloging software consists of these components:

PredictiveBIB

An experimental cloud-connected software cataloging platform that uses predictive-algorithms and crowdsourced-metadata, to simplify & accelerate creation of MARC21 bibliographic records. PredictiveBIB’s core innovation is the use of algorithms to process metadata in order to predict subject headings for the item being cataloged. This allows catalogers to add auto-suggested subject headings in seconds. Various other algorithms are used to auto-suggest LC genre/form terms, name authorities, title authorities, publishers, and DDC call numbers, whenever possible. Text analytics and other optimizations are used to simplify cataloging.

Cataloger-interface development started with the premise that catalogers should be freed up to focus solely on metadata, not record formatting. PredictiveBIB’s rich & responsive desktop app interface is considerably abstracted from the record format and looks nothing like a web form or MARC fields/subfields. It dynamically analyzes entered bibliographic metadata and delivers an intuitive shortest-path to completion. Validation is built-in to help maintain consistent quality. Text analysis techniques, such as named-entity recognition, spell-checking, casing & article checks, are utilized.

PredictiveBIB seeks to balance ease of cataloging and bibliographic record completeness. It supports minimal-level records (optionally with LCSH) and full-level records. Once a level is selected the cataloger is prompted to enter only essential bibliographic data. To shorten development time some materials are not yet supported, such as biographies and textbooks.

To get an idea of PredictiveBIB capabilities, review the sample records. These were generated by PredictiveBIB from bibliographic metadata entered in the app. Included are a diverse selection of English language, bilingual, multilingual, and foreign language books (thanks to Google Translate).

The only subsequent modification to the sample records was assignment of an artificial creation date to facilitate regression testing. ModMARC was not used on any of the sample records, although it could be used to tweak records. All the sample records were created through cataloging of books borrowed from public lending libraries. Every effort was made to avoid proprietary metadata & databases; FAST subject headings were not used even though they are good candidates for prediction and could readily be supported.

The quickest way to learn PredictiveBIB is to open sample records in the desktop app component and see how the book metadata is entered on the various app pages. Any of the sample records can be loaded into PredictiveBIB by typing the first letter of a title in a field on the app’s start page and selecting from an auto-suggest list; the MARC21 record is generated with one more mouse-click; then viewed in your text editor, or ViewMARC, or ModMARC with one further mouse-click.

PredictiveBIB generates MARC (.mrc), MARC XML (.xml), and human-readable MARC (.txt) records in the cloud, then saves them locally to the cataloger’s computer (or network share). PredictiveBIB additionally generates an intermediate data file (.metaxml) that contains all cataloger-supplied bibliographic metadata in case the cataloger wants to modify metadata in PredictiveBIB in order to regenerate their bibliographic records. This approach allows catalogers to locally retain the raw data from which the MARC record is constructed, as well as the bibliographic records. Library bibliographic records are by default not saved in the cloud and are private to the authoring library.

Although libraries can easily add a CC0 statement to bibliographic records that they create using PredictiveBIB, adding a CC0 is optional.

Cloud Predictive Algorithms

PredictiveBIB algorithms consume bibliographic metadata entered by catalogers during use of the software, making the metadata supply self-sustaining. The platform is pre-trained on books cataloged to date. Predictive-capability will expand into other subject areas as community cataloging progresses in those areas. Libraries intending to use PredictiveBIB will need to consent to the collection & use of all entered bibliographic metadata.

PredictiveBIB uses several algorithms to analyze bibliographic metadata entered by the user-community in order to predict subject headings. Algorithms:

  • Use word, synonym, and relevance analysis to assess & rank correlation strength between bibliographic metadata and subject headings.
  • Assign weightings to subject headings based on usage frequency.
  • Act on different sets of subject headings, for example:
    • Subject headings used by a library.
    • Subject headings used by all participant libraries.
    • 4XX and 5XX referenced subject headings within subject heading authority records.
  • Act on different sets of bibliographic metadata, for example:
    • Metadata entered by a library.
    • Metadata entered by all participant libraries.
  • Execute on different triggers:
    • Event triggers, such as when subject heading selection is required to complete a bibliographic record, or on creation of bibliographic records.
    • Periodic triggers, such as a daily update of subject heading usage statistics for all participant libraries.
  • Interact with datasets of associations that are used to predict subject headings for new materials being cataloged.
  • Are independent of bibliographic record format, material-type cataloged, language, subject area, or subject heading authority type.
  • Use linked data concepts, such as linking to subject heading authority narrower/broader terms to widen the association of subject headings with bibliographic metadata.
  • May use machine learning to automate processing of community crowdsourced metadata to build correlations between bibliographic metadata and subject headings.

As more materials are cataloged using PredictiveBIB, the associations are expanded and refined. Periodically the associations are reviewed & adjusted by a human because automated language processing is by no means infallible. This may fall under the category of ‘semi-supervised machine learning’.

Cloud Repository & Exchange Services (CRESS)

CRESS is an experimental cloud repository available to PredictiveBIB users who want to search and import public-domain MARC bibliographic records. Libraries may optionally choose to upload public-domain bibliographic records created using PredictiveBIB to CRESS.

ViewMARC

Experimental desktop cataloging software that allows in-depth inspection of MARC21 fields. A very simple to use (educational) tool for both copy and original catalogers. This tool is integrated in the PredictiveBIB desktop app.

ModMARC

Experimental desktop cataloging software that allows modification of individual MARC21 records. A simple but versatile tool for original catalogers. This tool is integrated in the PredictiveBIB desktop app.


Project Funding

No funding, public or otherwise, has been sought or received for any part of the Project.

For this project to evolve beyond pilot testing a revenue stream is needed that supports operational costs without compromising the mission.


Beta Testing

PredictiveBIB is available for beta testing from January 4th 2021 onwards.


Collaborative Cataloging

Catalogers may collaborate (remotely) with each other or the Project to create & enrich records in PredictiveBIB. For example, an onsite cataloger enters physical book properties and a remote cataloger completes remaining fields. Inter-cataloger notes & hyperlinks (e.g. to a CIP data image in a cloud-storage folder) are automatically imported with each record.


Source Code

Since the project is not publicly-funded and very new, I plan to take a considered approach to open-sourcing. Due to the size & complexity of Project software, and the expertise needed to run the distributed cataloging platform, open-sourcing the software is unlikely to benefit public lending libraries. Also, closed-source software has more commercial licensing potential, which could subsidize ongoing research and public lending library usage.

Project release software contains no third-party libraries outside of official Microsoft library packages.


Future Work

  • Fix any shortcomings identified during beta testing.

  • Support linked data, including all of the following:
    • Enhancement of MARC bibliographic records with linked data URIs.
    • Automated conversion of MARC bibliographic records to linked data RDF/XML records.
    • Cataloging using a combination of linked data URIs and MARC authority records.
    • Native generation of linked data RDF/XML records.

  • Add support for audiobooks. Also, other materials based on demand.

  • Continue algorithm R&D.

  • Apply predictive-algorithms to a readers advisory service.

  • Add further support for import of ONIX data. Currently, only basic ONIX fields are supported.


Support

Support requests can be sent to project@predictivebib.org.


Acknowledgments

Thanks are owed to many:

  • To a cataloger friend for generously supporting my eccentric approach to learning MARC, and setting aside valuable weekend time to teach me the art of cataloging.
  • To Creative Commons for public-domain advocacy & tools.
  • To staff at the Library of Congress for answering critical technical questions.
  • To local public lending libraries for providing books, even during COVID lockdown.
  • To the Library of Congress for their online documentation and tutorials.
  • To the University of Illinois at Urbana-Champaign for their open-source Metadata Maker, which hints at the possibilities of abstraction from MARC.
  • To librarians who shared their insights during a phone survey on library public-domain practices.
  • To the author of MarcEdit for answering a technical question related to MARC validation.
  • To OCLC for their online documentation.
  • To Microsoft for providing the open-source .NET developer platform.