Opendatabay APP

Threat Intelligence Text Dataset

Website Analytics & User Experience

Tags and Keywords

Token

Bert

Cyber

Deep

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Threat Intelligence Text Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This curated dataset, Cyber-BERT, is designed for Natural Language Processing (NLP) applications within the cybersecurity domain. It contains text extracted from various cybersecurity sources, encompassing topics such as malware analysis, vulnerabilities, cyber threats, and network security. The dataset is well-suited for training BERT-based models to perform essential tasks like threat detection, text classification, and broader cybersecurity research. The data has been meticulously preprocessed to ensure cleanliness, with URLs, non-text symbols, HTML tags, metadata, and redundant content removed.

Columns

  • text: This column contains the processed cybersecurity-related text.

Distribution

The dataset is typically provided in a CSV file format, making it readily accessible for various applications. It contains approximately 50,000 samples, though the exact number may vary based on collection updates. The data has undergone significant preprocessing to enhance its utility for NLP tasks, including the removal of URLs, non-text symbols, HTML tags, metadata, and duplicate entries.

Usage

This dataset offers a range of valuable applications, including:
  • Cyber Threat Detection: Utilise the dataset to train models for classifying security threats.
  • Named Entity Recognition (NER): Identify and extract key entities such as malware, exploits, and vulnerabilities from cybersecurity text.
  • Threat Intelligence Analysis: Extract valuable insights from cybersecurity reports and other relevant texts.
  • BERT Fine-Tuning: Build specialised NLP models tailored for security domains and specific cybersecurity challenges.

Coverage

The text within this dataset is extracted from prominent cybersecurity sources including TheHackerNews, CVE Details, Any.Run, and OpenPhish. The dataset's scope is global. Specific time ranges for the data content itself are not provided.

License

CCO

Who Can Use It

This dataset is an excellent resource for:
  • Researchers focused on advancing NLP techniques in cybersecurity.
  • Data Scientists and Machine Learning Engineers developing threat detection systems or text classification models.
  • Security Analysts looking to automate aspects of threat intelligence analysis.
  • Anyone involved in building specialised NLP models for security domains.

Dataset Name Suggestions

  • Cyber-BERT
  • Cybersecurity NLP Corpus
  • Threat Intelligence Text Dataset
  • Security Text Analytics Data
  • BERT Security Dataset

Attributes

Original Data Source: Cyber-BERT

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free