Threat Intelligence Text Dataset
Website Analytics & User Experience
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This curated dataset, Cyber-BERT, is designed for Natural Language Processing (NLP) applications within the cybersecurity domain. It contains text extracted from various cybersecurity sources, encompassing topics such as malware analysis, vulnerabilities, cyber threats, and network security. The dataset is well-suited for training BERT-based models to perform essential tasks like threat detection, text classification, and broader cybersecurity research. The data has been meticulously preprocessed to ensure cleanliness, with URLs, non-text symbols, HTML tags, metadata, and redundant content removed.
Columns
- text: This column contains the processed cybersecurity-related text.
Distribution
The dataset is typically provided in a CSV file format, making it readily accessible for various applications. It contains approximately 50,000 samples, though the exact number may vary based on collection updates. The data has undergone significant preprocessing to enhance its utility for NLP tasks, including the removal of URLs, non-text symbols, HTML tags, metadata, and duplicate entries.
Usage
This dataset offers a range of valuable applications, including:
- Cyber Threat Detection: Utilise the dataset to train models for classifying security threats.
- Named Entity Recognition (NER): Identify and extract key entities such as malware, exploits, and vulnerabilities from cybersecurity text.
- Threat Intelligence Analysis: Extract valuable insights from cybersecurity reports and other relevant texts.
- BERT Fine-Tuning: Build specialised NLP models tailored for security domains and specific cybersecurity challenges.
Coverage
The text within this dataset is extracted from prominent cybersecurity sources including TheHackerNews, CVE Details, Any.Run, and OpenPhish. The dataset's scope is global. Specific time ranges for the data content itself are not provided.
License
CCO
Who Can Use It
This dataset is an excellent resource for:
- Researchers focused on advancing NLP techniques in cybersecurity.
- Data Scientists and Machine Learning Engineers developing threat detection systems or text classification models.
- Security Analysts looking to automate aspects of threat intelligence analysis.
- Anyone involved in building specialised NLP models for security domains.
Dataset Name Suggestions
- Cyber-BERT
- Cybersecurity NLP Corpus
- Threat Intelligence Text Dataset
- Security Text Analytics Data
- BERT Security Dataset
Attributes
Original Data Source: Cyber-BERT