Dark Mode

Home

Data Categories

AI & ML Data

Threat Intelligence Text Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Threat Intelligence Text Dataset

Website Analytics & User Experience

Tags and Keywords

Token

Bert

Cyber

Deep

Nlp

Trusted By

Threat Intelligence Text Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This curated dataset, Cyber-BERT, is designed for Natural Language Processing (NLP) applications within the cybersecurity domain. It contains text extracted from various cybersecurity sources, encompassing topics such as malware analysis, vulnerabilities, cyber threats, and network security. The dataset is well-suited for training BERT-based models to perform essential tasks like threat detection, text classification, and broader cybersecurity research. The data has been meticulously preprocessed to ensure cleanliness, with URLs, non-text symbols, HTML tags, metadata, and redundant content removed.

Columns

text: This column contains the processed cybersecurity-related text.

Distribution

The dataset is typically provided in a CSV file format, making it readily accessible for various applications. It contains approximately 50,000 samples, though the exact number may vary based on collection updates. The data has undergone significant preprocessing to enhance its utility for NLP tasks, including the removal of URLs, non-text symbols, HTML tags, metadata, and duplicate entries.

Usage

This dataset offers a range of valuable applications, including:

Cyber Threat Detection: Utilise the dataset to train models for classifying security threats.
Named Entity Recognition (NER): Identify and extract key entities such as malware, exploits, and vulnerabilities from cybersecurity text.
Threat Intelligence Analysis: Extract valuable insights from cybersecurity reports and other relevant texts.
BERT Fine-Tuning: Build specialised NLP models tailored for security domains and specific cybersecurity challenges.

Coverage

The text within this dataset is extracted from prominent cybersecurity sources including TheHackerNews, CVE Details, Any.Run, and OpenPhish. The dataset's scope is global. Specific time ranges for the data content itself are not provided.

License

CCO

Who Can Use It

This dataset is an excellent resource for:

Researchers focused on advancing NLP techniques in cybersecurity.
Data Scientists and Machine Learning Engineers developing threat detection systems or text classification models.
Security Analysts looking to automate aspects of threat intelligence analysis.
Anyone involved in building specialised NLP models for security domains.

Dataset Name Suggestions

Cyber-BERT
Cybersecurity NLP Corpus
Threat Intelligence Text Dataset
Security Text Analytics Data
BERT Security Dataset

Attributes

Original Data Source: Cyber-BERT

Listing Stats

VIEWS

DOWNLOADS

LISTED

08/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

Threat Intelligence Text Dataset

Website Analytics & User Experience

Tags and Keywords

Token

Bert

Cyber

Deep

Nlp

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS