  • Publication
    The library of Daniel Garrison Brinton
    (University of Pennsylvania. Museum of Archaeology and Anthropology.) Weeks; Weeks, John M.
    This book includes an introductory essay and item-by-item catalog for the personal library of Daniel Garrison Brinton. First a professor of Ethnology and Anthropology at the Academy of Natural Sciences, then later Penn’s first Professor of Archaeology and Linguistics, Brinton (1837-1899) is considered one of the fathers of American anthropology. He built a substantial personal library of over 4000 items, mainly focused on the languages and culture of indigenous peoples of North and Central America. It included several early-modern manuscripts, modern transcriptions, and early printed books from the estate of ethnologist and linguistic scholar, Carl Hermann Berendt (1817-1878), and from the Mayanist, Charles Etienne Brasseur de Bourbourg (1814-1874). After his death, Brinton’s library passed to the Penn Museum for the creation of the Penn Museum Library in 1900.
  • Publication
    Sociotechnical Automation Science: A Case Study in Developing and Augmenting an Ensemble Neural Network with Multiple LLMs for Subject Cataloging at the Penn Libraries
    (2024-06-26) Hahn, Jim
    The sociotechnical aspects of automation play a crucial role in the development of machine learning systems. Through deep collaboration with cataloging professionals at the Penn Libraries, we have created a set of subject indexing algorithms that are ensembled into a neural network. Librarians have evaluated multiple rounds of the algorithm outputs. By identifying the failure points in the neural network-based subject assignment process, we incorporated LLM tasks such as evaluating search result relevance, summarizing search results, and assessing topical assignments of synthetic summaries. Implementing LLM tasks draws on the linguistic strengths of LLMs, rather than world knowledge. The data processing is integrated into an Apache Airflow pipeline, allowing librarians to input an Excel file, which begins the workflow for generating candidate subject descriptions. These machine learning outputs are poised for a pilot test in production systems this summer.
  • Publication
    FAIR Assessment Checklist for Data Repositories
    (University of Pennsylvania, 2024-01-25) Phegley, Lauren
    This assessment checklist is intended to support data repository managers who want to evaluate their repositories FAIR enabling practices. The FAIR checklist is provided as a guide to evaluating current implementation and future actions to make a repository FAIR enabling. The intention of this checklist is to allow for honest evaluation of concrete ways to be FAIR enabling, rather than admonishment for lack of adoption.
  • Publication
    Data Dictionary Blank Template
    (2023-10) Phegley, Lauren
    This is a blank data dictionary template intended to assist researchers with documenting the variables, structure, content, and layout of their datasets. A good data dictionary has enough information about each variable for it to be self explanatory and interpreted properly by someone outside of the original research group. There are two different file types for the data dictionary avaliable, a Excel file (.xslx) and a .csv file. The Excel file has both the template and the field descriptions on different sheets, while the .csv template and field descriptions are seperated into two csv's. This is because csv's do no allow for multiple sheets in one file. The template section provides you with commonly required columns that are necessary to fully define your data. The field descriptions section is where you define the column headers and possible values that can be entered. There is an example in the first row that can be deleted for you to enter in your own data. This template is build off of the Ag Data Commons "Data Dictionary - Blank Template" from the United States Department of Agriculture ( [no longer accessible online as of 2023-12-18].
  • Publication
    Audiovisual Data Curation Primer Presentation
    (2023-12-14) Phegley, Lauren
    This presentation was given as part of the Data Curation Network's Primer Webinar held on 2023-12-14. The authors presented the highlights of our Audiovisual Data Curation Primer, which is a peer-reviewed concise resource designed to provide support for data curators in learning about audiovisual files. The full primer is openly avaliable at
  • Publication
    BIBFRAME instance mining: Toward authoritative publisher entities using association rules
    (2020-11-25) Hahn, Jim
    With the transition of a shared catalog to BIBFRAME linked data, there is now a pressing need for identifying the canonical Instance for clustering in BIBFRAME. A fundamental component of Instance identification is by way of authoritative publisher entities. Previous work in this area by OCLC research (Connaway & Dickey, 2011) proposed a data mining approach for developing an experimental Publisher Name Authority File (PNAF). The OCLC research was able to create profiles for "high-incidence" publishers after data mining and clustering of publishers. As a component of PNAF, Connaway & Dickney were able to provide detailed subject analysis of publishers. This presentation will detail a case study of machine learning methods over a corpus of subjects, main entries, and added entries, as antecedents into association rules to derive consequent publisher entities. The departure point for the present research into identification of authoritative publisher entities is to focus on clustering, reconciliation and re-use of ISBN and subfield b of MARC 260 along with the subjects (650 - Subject Added Entry), main entries (1XX - Main Entries) and added entries (710 - Added Entry-Corporate Name) as signals to inform a training corpus into association rule mining, among other machine learning algorithms, libraries, and methods.
  • Publication
    BF Interlingua: Interoperability among BIBFRAME linked data vocabularies
    (2023-01-19) Hahn, Jim
    Presentation exploring an interchange process among BIBFRAME linked data vocabularies.
  • Publication
    SVDE model interoperability: SVDE and the BIBFRAME interchange structure
    (2022-11-08) Hahn, Jim
    Provides an overview on a possible interchange structure for BIBFRAME using RDF/XML from Library of Congress as the interchange structure. The presentation details selected normalization steps of an SVDE instance into the RDF/XML Library of Congress structure. The presentation concludes with an example of loading SVDE normalized data into the Alma Sandbox at Penn by way of a locally hosted linked data editor, Marva.
  • Publication
    Bibliographic Entities are Described by Sets
    (2021-07-26) Hahn, Jim
    A set theoretical frame based on Svenonius's theory of bibliographic entities is the departure point for this short talk on entity description. This talk will briefly show how properties of bibliographic entity descriptions may be identified using a frequent pattern data mining algorithm over targeted sets of existing metadata descriptions. The MARC21 corpus used in this case was comprised of clustered sets of publishers and publisher locations from the library MARC21 records found in the Platform for Open Data (POD). POD is a data aggregation project involving member institutions of the IvyPlus Library Confederation and contains seventy million MARC21 records, forty million of which are unique.
  • Publication
    Share-VDE 2.0: a panel discussion among the Share-VDE working group chairs
    (2021-07-21) Hahn, Jim
    This panel will convene a diverse group of linked data professionals who serve as chairs of the Share-VDE working groups. The working groups include an Advisory Council (AC), Authority-Identifier Management Services (AIMS), Cluster Knowledge Base Editor (CKB), Sapientia Entity Identification (SEI), and a UX/UI group. The overall effect of combining their focus areas can be seen in the new Share-VDE 2.0 platform. Panelists will discuss how Share-VDE 2.0 implements an interoperable ecosystem of linked data structures and projects (e.g. LD4P/Sinopia) in part due to the information modeling that was completed by the SEI working group which sought to reference and implement interoperable BIBFRAME entity models. To manage such models a design of J.Cricket with the CKB editor WG was completed while incorporating services for the critical tasks of authority control with the AIMS WG. Panelists have contributed to the significant revision and enhancement of SVDE infrastructure including support for a re-visioning of the front end discovery interface that presents a next generation linked data discovery system, Share-VDE.