TA-KDD-19 Malware Traffic Analysis Knowledge Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This is an updated and refined dataset, specifically created to train and evaluate machine learning algorithms focused on malware traffic analysis. It was derived from the largest available online databases containing network traffic captures. The data product is characterized by a carefully selected set of widely-applicable features that have been cleaned and preprocessed. This preparation included handling missing data, removing noise, and keeping the size minimal. Although tailored for machine learning algorithms, the resulting dataset is designed to be applicable without bias toward any single application. The original methodology was designed to be automated to allow for ongoing updates, though the current dataset is not expected to receive future updates.
Columns
The data contains 34 columns, including features related to flow statistics, packet lengths, inter-arrival times (IAT), flag distributions, and protocol usage. All columns in this data product contain valid data with 0% mismatched or missing values across the 30.2k records.
Key fields include distributions of TCP flags:
- FinFlagDist (1): Mean 0.54, Std. Deviation 0.76.
- SynFlagDist (2): Mean 0.39, Std. Deviation 0.83.
- RstFlagDist (3): Mean -0.24, Std. Deviation 0.92.
- PshFlagDist (4): Mean 0.3, Std. Deviation 0.79.
- AckFlagDist (5): Mean 0.38, Std. Deviation 0.71.
Protocol and connectivity metrics:
- DNSoverIP (6): Mean -0.13, Std. Deviation 0.42.
- TCPoverIP (7): Mean 0.11, Std. Deviation 0.56.
- UDPoverIP (8): Mean -0.12, Std. Deviation 0.51.
- NumPorts (25): Mean 0.39, Std. Deviation 0.99.
- NumCon (29): Mean -0.18, Std. Deviation 0.28.
- NumIPdst (30): Mean -0.18, Std. Deviation 0.27.
- HTTPpkts (33): Mean 0.06, Std. Deviation 1.
Length and flow statistics:
- MaxLen (9), MinLen (10), StdDevLen (11), AvgLen (12).
- MaxLenrx (19), MinLenrx (20), StdDevLenrx (21), AvgLenrx (22).
- FlowLEN (26), FlowLENrx (27).
- PktsIOratio (17), repeated_pkts_ratio (28).
- 1stPktLen (18).
Inter-arrival Time (IAT) metrics:
- MaxIAT (13), MinIAT (14), AvgIAT (15).
- MinIATrx (23), AvgIATrx (24).
- AvgWinFlow (16).
- DeltaTimeFlow (32).
- Start_flow (31).
The final column is the label (34), which has a mean and standard deviation of 0 across all records.
Distribution
The dataset structure involves 34 columns of data, likely formatted as a CSV file titled
datasetLegitimate33featues.csv, which is approximately 19.84 MB in size. It contains a total of 30.2 thousand valid records. Crucially, inspection of the provided data distributions confirms that every feature is 100% valid, with 0% mismatched or missing entries, indicating a clean and refined structure. The expected update frequency for this specific product is listed as 'Never'.Usage
This dataset is ideally suited for developing and validating algorithms intended for the identification and categorization of malware-related network traffic. It is particularly useful for researchers and practitioners implementing machine learning models for network security monitoring and anomaly detection.
Coverage
The source material does not specify the geographic location, time range, or demographic scope of the underlying network captures used to create the data.
License
CC BY-NC-SA 4.0
Who Can Use It
- Machine Learning Engineers: Utilizing the features to train and optimize novel malware detection models.
- Cybersecurity Researchers: Evaluating the performance and robustness of existing or new network intrusion detection systems (NIDS).
- Data Scientists: Performing statistical analysis on network flow characteristics to identify patterns specific to malicious activity.
- Government Agencies: Developing improved techniques for network traffic safeguarding (the dataset is tagged as 'Government').
Dataset Name Suggestions
- MTA-KDD-19 Malware Traffic Analysis Knowledge Dataset 2019
- Network Traffic Malware ML Feature Set
- Preprocessed Network Flow Security Data
Attributes
Original Data Source: TA-KDD-19 Malware Traffic Analysis Knowledge Dataset
Loading...
