Papers : Metadata Quality Solutions

Starts at: Wed, Oct 23, 2024, 14:00 EDT
Finishes at: Wed, Oct 23, 2024, 15:30 EDT
Venue: DSI Seminar room
Moderator: Inkyung Choi

Moderator

Inkyung Choi

OCLC
ORCID

LinkedIn
Inkyung Choi received her MLIS at Syracuse University and PhD at the University of Wisconsin-Milwaukee. She served as a teaching assistant professor and taught metadata, linked data, information modeling, information organization at the School of Information Sciences at the University of Illinois Urbana-Champaign. She currently serves on DCMI education committee and actively engages with members of metadata community worldwide. Her research interests are knowledge organization, metadata, ontology, and linked data.

Presentations

An LLM Based Method for Domain Specific Mapping of Metadata Terms to a Thesaurus

Authors: Mahiro Irie, Mitsuharu Nagamori

Metadata terms with high interoperability may assume different roles from their labels depending on the context in which they are used. However, due to insufficient definitions of vocabularies and relationships between terms, metadata schema designers with less knowledge in metadata find it challenging to identify interoperable terms from the domain-specific vocabularies used. This study proposes a method to map interoperable metadata terms to the concepts of a thesaurus that represent the roles of terms in specific contexts employing the Large Language Models (LLMs). By combining interoperable metadata terms with the domains in which they are used, it is possible to associate them with words that represent their roles. Without extensive metadata knowledge, users are expected to discover more interoperable terms using this approach through a thesaurus-based search.

Mahiro Irie

Master’s Programs in Informatics, University of Tsukuba
ORCID
Mahiro Irie is a master candidate student in Information Science and a member of Metadata Laboratory at University of Tsukuba, Japan. He earned a Bachelor of Science in Media Sciences and Engineering, University of Tsukuba, Japan in 2024. His research interests are metadata schema, metadata vocabulary and metadata interoperability.

Synthetic Signal Identification in LLM Datasets

Authors: Jim Hahn

This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.

Jim Hahn

Penn Libraries
ORCID

LinkedIn
I am the Head of Metadata Research at the University of Pennsylvania Libraries leading linked data and metadata projects and research for the Libraries. Working collaboratively across the Libraries, my work is developing a vision for the services, technologies and policies to enhance discovery of collections, following international standards and best practices for linked data and metadata. I hold an M.S. and C.A.S. in Library and Information Science from University of Illinois and I am a current PhD student in Information Sciences at the University of Illinois.

Leveraging Linked Data Fragments for enhanced data publication: the Share-VDE case study

Authors: Andrea Gazzarini

In big data-driven environments, accessing, querying, and processing vast datasets efficiently is challenging.

Linked Data Fragments (LDF) have emerged as a promising paradigm for addressing these challenges. They provide a distributed and scalable approach for publishing and serving Linked Data.

As part of the Share-VDE initiative, we developed a set of Web APIs that adopt Linked Data Fragments and provide the following benefits: real-time RDF generation and publication, on-demand ontology mapping, and multi-provenance management.

We aim to showcase our implementation efforts in enabling real-time, multi-provenance, and multi-mapping RDF publication by introducing an RDF API layer built upon the innovative concept of Linked Data Fragments.

Andrea Gazzarini

SpazioCodice SRL
ORCID

WebPage

Twitter

LinkedIn
Andrea has over 20 years of experience in various software engineering areas, from telecommunications to banking. He has worked for several medium- and large-scale companies, such as IBM and Orga Systems.
Andrea has several certifications in the Java programming language (programmer, developer, web component developer, business component developer, and JEE architect), BEA products (build and portal solutions), and Apache Solr (Lucid Apache Solr/Lucene Certified Developer).

In 2009, Andrea entered the wonderful world of open-source projects and became a committer for the Apache Qpid project.
His adventure with the search domain began in 2010 when he met Apache Solr and, later, Elasticsearch… and it was love at first sight. Since then, he has been a search engineer in many projects in different fields (bibliographic, e-government, e-commerce, geospatial).
In 2015 he wrote “Apache Solr Essentials”, published by Packt Publishing.
In 2018 he founded his company, SpazioCodice.
Since 2020 he has been the tech lead of the Share-VDE initiative, a library-driven initiative that brings together the bibliographic catalogs and authority files of a community of libraries in a shared discovery environment based on Linked Data.

Metadata Enrichment with Named Entity Recognition using GPT-4

Authors: Ashwin Nair, Ee Min Hoon, Robin Dresel

To enhance the user experience and resource discoverability of Infopedia, the Singapore encyclopedia, the National Library Board of Singapore (NLB) uses Generative Pre-trained Transformer 4 (GPT-4) for Named Entity Recognition (NER), aiming to automate metadata enrichment of its digital encyclopedia articles. This initiative leverages GPT-4's capabilities in accurately identifying and incorporating relevant Singaporean entities before integrating them into the NLB's Knowledge Graph, improving recommendations of related resources. An evaluation on a subset of 100 articles demonstrates a precision score of 0.975, indicating high entity detection with minimal inaccuracies. The team acknowledges challenges related to GPT-4’s black-box nature and the potential for non-reproducibility. This effort illustrates the potential of generative AI to streamline metadata enrichment processes, offering a promising avenue for enhancing metadata of digital libraries.

Ashwin Nair

National Library Board
ORCID

LinkedIn
Ashwin Nair is a Manager/Librarian (Systems) at the National Library Board in Singapore, with a strong background in AI, machine learning, and data science. He holds a Bachelor's degree in Information & Communication Technology and has experience in leveraging advanced technologies like GPT-4, graph neural networks, and SPARQL for various library and information management projects. Ashwin has also worked as a Research Assistant in scientometrics at Nanyang Technological University, where he developed predictive models for enhancing interdisciplinary collaborations. His skills span across programming, data analysis, and machine learning, with a particular focus on natural language processing and network science to solve real-world optimization problems.