Opendatabay APP

PubMed MeSH Article Classification Dataset

Education & Learning Analytics

Tags and Keywords

Pubmed

Mesh

Biomedical

Classification

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
PubMed MeSH Article Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset comprises approximately 50,000 research articles sourced from the PubMed repository. Each article has been meticulously annotated by biomedical experts with 10-15 MeSH (Medical Subject Headings) labels. It is primarily designed for extreme multi-label text classification within the biomedical domain. The dataset addresses challenges associated with an extremely large output space and severe label sparsity by having its original labels processed and mapped to their root MeSH categories.

Columns

  • Title: The title of each research article.
  • abstractText: The abstract text extracted from each article.
  • meshMajor: The Medical Subject Headings manually assigned to each article by experts.
  • pmid: The unique PubMed Identifier for each individual article.
  • meshid: The specific MeSH ID corresponding to the assigned Medical Subject Headings.
  • meshroot: The broader, mapped root category for each MeSH label, simplifying the classification task.
  • A: A binary indicator representing the 'Anatomy' MeSH root category.
  • B: A binary indicator representing the 'Organisms' MeSH root category.
  • C: A binary indicator representing the 'Diseases' MeSH root category.
  • D: A binary indicator representing the 'Chemicals and Drugs' MeSH root category.
  • E: A binary indicator representing 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment' MeSH root category.
  • F: A binary indicator representing 'Psychiatry and Psychology' MeSH root category.
  • G: A binary indicator representing 'Phenomena and Processes' MeSH root category.
  • H: A binary indicator representing 'Disciplines and Occupations' MeSH root category.
  • I: A binary indicator representing 'Anthropology, Education, Sociology, and Social Phenomena' MeSH root category.
  • J: A binary indicator representing 'Technology, Industry, and Agriculture' MeSH root category.
  • L: A binary indicator representing 'Information Science' MeSH root category.
  • M: A binary indicator representing 'Named Groups' MeSH root category.
  • N: A binary indicator representing 'Health Care' MeSH root category.
  • Z: A binary indicator representing 'Geographicals' MeSH root category.

Distribution

The dataset is provided in CSV format. It contains approximately 50,000 records (rows) and has a file size of 119.8 MB.

Usage

This dataset is ideal for:
  • Developing and evaluating models for extreme multi-label text classification in biomedical literature.
  • Applications leveraging deep learning and transfer learning techniques on large text corpora.
  • Text mining and natural language processing (NLP) research focusing on medical articles.
  • Addressing challenges related to large output spaces and label sparsity in classification tasks.

Coverage

The dataset's scope encompasses research articles available within the PubMed repository. Specific geographic coverage, time ranges, or detailed demographic information beyond the nature of the articles (biomedical) are not provided.

License

CC0: Public Domain

Who Can Use It

  • Machine learning engineers and data scientists specialising in natural language processing and text classification.
  • Researchers and academics in fields such as biomedical informatics, computational linguistics, and artificial intelligence.
  • Students and practitioners looking to train or test models on real-world medical and scientific textual data.

Dataset Name Suggestions

  • PubMed MeSH Article Classification Dataset
  • Biomedical Multi-Label Text Classification (MeSH)
  • Medical Research Article MeSH Dataset
  • PubMed Expert-Annotated MeSH Corpus

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

13/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format