PubMed MeSH Article Classification Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset comprises approximately 50,000 research articles sourced from the PubMed repository. Each article has been meticulously annotated by biomedical experts with 10-15 MeSH (Medical Subject Headings) labels. It is primarily designed for extreme multi-label text classification within the biomedical domain. The dataset addresses challenges associated with an extremely large output space and severe label sparsity by having its original labels processed and mapped to their root MeSH categories.
Columns
- Title: The title of each research article.
- abstractText: The abstract text extracted from each article.
- meshMajor: The Medical Subject Headings manually assigned to each article by experts.
- pmid: The unique PubMed Identifier for each individual article.
- meshid: The specific MeSH ID corresponding to the assigned Medical Subject Headings.
- meshroot: The broader, mapped root category for each MeSH label, simplifying the classification task.
- A: A binary indicator representing the 'Anatomy' MeSH root category.
- B: A binary indicator representing the 'Organisms' MeSH root category.
- C: A binary indicator representing the 'Diseases' MeSH root category.
- D: A binary indicator representing the 'Chemicals and Drugs' MeSH root category.
- E: A binary indicator representing 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment' MeSH root category.
- F: A binary indicator representing 'Psychiatry and Psychology' MeSH root category.
- G: A binary indicator representing 'Phenomena and Processes' MeSH root category.
- H: A binary indicator representing 'Disciplines and Occupations' MeSH root category.
- I: A binary indicator representing 'Anthropology, Education, Sociology, and Social Phenomena' MeSH root category.
- J: A binary indicator representing 'Technology, Industry, and Agriculture' MeSH root category.
- L: A binary indicator representing 'Information Science' MeSH root category.
- M: A binary indicator representing 'Named Groups' MeSH root category.
- N: A binary indicator representing 'Health Care' MeSH root category.
- Z: A binary indicator representing 'Geographicals' MeSH root category.
Distribution
The dataset is provided in CSV format. It contains approximately 50,000 records (rows) and has a file size of 119.8 MB.
Usage
This dataset is ideal for:
- Developing and evaluating models for extreme multi-label text classification in biomedical literature.
- Applications leveraging deep learning and transfer learning techniques on large text corpora.
- Text mining and natural language processing (NLP) research focusing on medical articles.
- Addressing challenges related to large output spaces and label sparsity in classification tasks.
Coverage
The dataset's scope encompasses research articles available within the PubMed repository. Specific geographic coverage, time ranges, or detailed demographic information beyond the nature of the articles (biomedical) are not provided.
License
CC0: Public Domain
Who Can Use It
- Machine learning engineers and data scientists specialising in natural language processing and text classification.
- Researchers and academics in fields such as biomedical informatics, computational linguistics, and artificial intelligence.
- Students and practitioners looking to train or test models on real-world medical and scientific textual data.
Dataset Name Suggestions
- PubMed MeSH Article Classification Dataset
- Biomedical Multi-Label Text Classification (MeSH)
- Medical Research Article MeSH Dataset
- PubMed Expert-Annotated MeSH Corpus
Attributes
Original Data Source Link: PubMed MeSH Article Classification Dataset