Opendatabay APP

Global Patent Categorisation Dataset

Government & Civic Records

Tags and Keywords

Law

Beginner

Text

Nlp

Patent

Classification

Codes

Uspto

Meaning

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Global Patent Categorisation Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides the meaning for each Cooperative Patent Classification (CPC) code, a hierarchical system used to categorise patents. It adds valuable context for analyses or competitions involving U.S. patent phrases. The data aims to enhance understanding of patent classifications by providing clear definitions.

Columns

  • code: This is a hierarchical code used for categorising patents. It corresponds to the "context" column often found in US Patent competition data.
  • title: This column contains the meaning or definition of the associated CPC code.
  • section: This represents the "section" symbol within the CPC system, typically ranging from A to H, and including Y.
  • class: A two-digit "class" identifier within the hierarchical code.
  • subclass: A single-letter subclass identifier.
  • group: A one to three-digit group identifier.
  • main_group: This consists of a two or more digit main or subgroup, appearing after a forward slash in the code structure.

Distribution

The dataset is typically provided in a CSV file format. It contains a substantial number of unique values for its core identifiers: there are 260,476 unique code values and 223,674 unique title values.
Regarding the distribution of patent sections:
  • Section 'B' accounts for approximately 22% of the codes.
  • Section 'H' accounts for about 15% of the codes.
  • Other sections collectively make up roughly 63% of the codes.
  • In another reported distribution, Section 'B' is at 20%, 'C' at 10%, and 'Other' sections at 70%.
Various numerical ranges with associated label counts are also present, indicating different distributions of some characteristic of the data:
  • A set of counts spans ranges from 1.00 - 2.96 (56,473 labels) up to 97.04 - 99.00 (24 labels).
  • Another set includes ranges from 1.00 - 60.96 (150,147 labels) up to 2939.04 - 2999.00 (590 labels).
  • A third set of counts covers ranges from 0.00 - 19215.90 (236,188 labels) extending to 941579.10 - 960795.00 (10 labels), suggesting a total of around 961,000 labels.

Usage

This dataset is ideal for several applications, including:
  • Enhancing data for U.S. Patent Phrase to Phrase Matching competitions.
  • Providing contextual information for Natural Language Processing (NLP) tasks related to patent documents.
  • Supporting text classification models that rely on patent codes.
  • Research and analysis of patent landscapes and technological areas.

Coverage

  • Geographic Scope: The data has a global scope.
  • Time Range: The dataset was listed on 08 June 2025. Specific historical time coverage for the patents themselves is not detailed.
  • Quality: The dataset is listed with a quality rating of 5 out of 5.

License

CCO

Who Can Use It

  • Competitors: Individuals participating in patent-related matching challenges, such as the U.S. Patent Phrase to Phrase Matching competition.
  • Data Scientists & NLP Practitioners: Those working on text analysis, classification, and understanding of legal or technical documents.
  • Researchers: Academics or industry professionals studying patent trends, intellectual property, or innovation landscapes.
  • Developers: Building applications that require an understanding or categorisation of patent information.

Dataset Name Suggestions

  • Cooperative Patent Classification Code Meanings
  • CPC Code Definitions for Patents
  • Global Patent Categorisation Data
  • USPTO Patent Classification Glossary
  • Hierarchical Patent Codes

Attributes

Listing Stats

VIEWS

2

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free