Opendatabay APP

Synthetic PII Annotation Corpus

Synthetic Data Generation

Tags and Keywords

Pii

Detection

Privacy

Gpt

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Synthetic PII Annotation Corpus Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is specifically designed for PII (Personally Identifiable Information) data detection, created as an external resource for The Learning Agency Lab. It was generated using GPT to provide a rich source of data for identifying sensitive information within text. Its primary purpose is to facilitate the development and testing of algorithms aimed at protecting privacy by accurately detecting and categorising various types of PII.

Columns

  • Essay: This column contains the textual data where PII may be embedded. It features 2000 unique text entries, all of which are valid.
  • PII: This column holds the detected PII information for each corresponding 'Essay' entry. The data is structured to detail specific PII types such as student names, email addresses, usernames, identification numbers, phone numbers, and personal URLs. Like the 'Essay' column, it also contains 2000 unique entries and is entirely valid.

Distribution

The dataset is provided in a CSV (Comma Separated Values) format and is approximately 6.31 MB in size. It comprises two distinct columns and is structured across 2000 rows or records, with each record containing an essay and its associated PII detections.

Usage

This dataset is ideal for:
  • Developing and training machine learning models for automated PII detection.
  • Testing the effectiveness and accuracy of existing privacy protection algorithms.
  • Research and academic studies focused on natural language processing and data privacy.
  • Building applications that require sensitive information identification and anonymisation.

Coverage

The dataset's content is synthetically generated using GPT, focusing on common PII types found in various textual contexts. Specific geographic, time range, or demographic scope is not explicitly defined, as the data is created to represent diverse PII patterns rather than real-world individual data.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

This dataset is particularly useful for:
  • Data Scientists and Machine Learning Engineers: For building and refining PII detection models.
  • Researchers and Academics: Engaging in studies related to privacy-preserving AI and NLP.
  • Software Developers: Integrating PII detection capabilities into their applications.
  • Educational Institutions: For teaching and demonstrating concepts of data privacy and security.

Dataset Name Suggestions

  • Learning Agency Lab PII Detection Dataset
  • GPT-Generated PII Identification Data
  • Synthetic PII Annotation Corpus
  • Privacy Information Extraction Data

Attributes

Original Data Source: Synthetic PII Annotation Corpus

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

19/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format