The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Naïve Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.
Friday, September 17, 2004
Automatic Classification
Managing Content with Automatic Document Classification by Rafael A. Calvo, Jae-Moon Lee and Xiaobo Li appears in the latest Journal of Digital Information, vol. 5, no. 2.
Labels:
Classification
Thursday, September 16, 2004
Content Management System
Railroad is a standards-based repository for large binary files such as digital media, along with their metadata. It is designed to be easy to integrate with content management systems and other client software.Many CMSes are more suitable for document-style content than they are for managing large files. Managing such large-file content in a CMS can result in scalability issues and deteriorating performance. Railroad instead is dedicated to the task of managing large files and their metadata.Railroad uses the industry-standard Apache HTTP server. It uses Apache's mod_dav and mod_python, and metadata is stored in a PostgreSQL database. Information about metadata in a repository can be accessed and manipulated using WebDAV, and can also be extracted using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) in the standard Dublin Core format.Glad to see it is using some standards in the metadata area.
MARC Code Lists
Additions to the MARC Code Lists for Relators, Sources, Description ConventionsThe codes listed below have been recently approved for use in MARC 21
records. They include 3 new subject source codes, 1 new description convention source code and 1 redefinition of an authentication code. These new codes will be added to the online MARC Code Lists for Relators, Sources, Description Conventions. The new codes should not be used in exchange records until after November 15, 2004. This 60-day waiting period is required to provide MARC 21 implementers with time to include newly defined codes in any validation tables they may apply to the MARC fields where these codes are used.MARC Term, Name, Title Sources
Additions:
Additions:
Redefinition:
records. They include 3 new subject source codes, 1 new description convention source code and 1 redefinition of an authentication code. These new codes will be added to the online MARC Code Lists for Relators, Sources, Description Conventions. The new codes should not be used in exchange records until after November 15, 2004. This 60-day waiting period is required to provide MARC 21 implementers with time to include newly defined codes in any validation tables they may apply to the MARC fields where these codes are used.MARC Term, Name, Title Sources
Additions:
- dacs - Describing Archives: A Content Standard (subfield $2 in Bibliographic and Community Information records in fields 600-651, 655-658 and field 040, subfield $f (Cataloging Source / Subject heading/thesaurus conventions) in Authority records)slvps - Standards of Learning for Virginia Public Schools (subfield $2 in Bibliographic and Community Information records in field 600-651, 655-658) and field 040, subfield $f (Cataloging Source / Subject heading/thesaurus conventions) in Authority records)smda -Smithsonian National Air and Space Museum Directory of Airplanes (subfield $2 in Bibliographic and Community Information records in fields 600-651, 655-658 and field 040, subfield $f (Cataloging Source / Subject heading/thesaurus conventions) in Authority records)
Additions:
- rpk - Rossiiskiie pravila katalogizatsii (subfield $e in Authority and Bibliographic records in field 040)
Redefinition:
- nlc ( Field 042, Authentication Code)
Former definition: Code nlc signifies that the CONSER descriptive elements and headings have been verified by the National Library of Canada. NLC authenticates records for Canadian imprints and records of Canadian interest.
New definition: Code nlc signifies that the CONSER descriptive elements and headings have been verified by Library and Archives Canada. LAC authenticates records for Canadian imprints and records of Canadian interest.
Wednesday, September 15, 2004
Catalogs
The National Library of Medicine has their catalog available in the Entrez interface used for PubMed. It includes an XML display option, explosion of MeSH terms and more. Authority records for names and titles are not currently available in the NLM Catalog.
Tuesday, September 14, 2004
Guide to Institutional Repository Software
The Guide to Institutional Repository Software v 3.0 has been released by the Budapest Open Access Initiative.
Universities and research centers throughout the world are actively planning and implementing institutional repositories. This activity entails policy, legal, educational, cultural, and technical components, most of which are interrelated and each of which must be satisfactorily addressed for the repository to succeed.The Open Society Institute intends this guide to help organizations with one facet of their repository planning: selecting the software system that best satisfies their institution’s needs. These needs will be driven by each institution’s content policies and by the various administrative and technical procedures required to implement those policies. Therefore, this guide is designed for institutions already familiar with the various administrative, policy, and related planning issues relevant to implementing an institutional repository. Organizations just starting their evaluation of the benefits and features offered by an institutional repository should first refer to the growing background literature as a context for using this guide
Monday, September 13, 2004
FOAF
The papers from the FOAF-Galway meeting in early September are available. This is a metadata schema I find facinating, it describes personal relationships. Human beings are great resources and describing a social network could provide a unique method to access that wisdom and knowledge.Some of the papers are:
- Bootstrapping the FOAF-Web: An Experiment in Social Network Mining Peter MikaDescriptions of Social Relations Peter Mika, Aldo GangemiFOAF-Realm - control your friends' access to resources Sebastian Ryszard KrukKeyword Extraction from the Web for FOAF Metadata Junichiro Mori, Yutaka Matsuo, Mitsuru Ishizuka, Boi FaltingsLinking Semantically-Enabled Online Community Sites Andreas Harth, John G. Breslin, Ina O'Murchu, Stefan DeckerUsing RDF + FOAF to create a local business review and search network Chris SchmidtMoleskiing: a Trust-aware Decentralized Recommender System Paolo Avesani, Paolo Massa, Roberto Tiella
Labels:
FOAF
Dublin Core
DC-Lib, the library application profile, has been revised and is available for review. Comments are welcomed. The profile has been reformatted in accordance with the CEN Application Profile Guidelines.
Labels:
Dublin Core
Subscribe to:
Posts (Atom)