Opendatabay APP

SCP Foundation Main Series Text Repository

Data Science and Analytics

Tags and Keywords

Text

Nlp

Horror

Culture

Scp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
SCP Foundation Main Series Text Repository Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Data provides a large, high-quality text repository focused on natural language processing (NLP), specifically leaning towards the genres of horror and urban legend. The resource contains main series articles from the SCP Foundation, numbered 001 through 6999. This material captures detailed text, ratings, and tags, facilitating detailed analysis of the structure and reception of speculative fiction content.

Columns

The dataset contains eight attributes detailing each SCP article:
  • Code: The formal code name for the SCP (e.g., SCP-007), which is zero-padded if three digits or less. This column has 6,999 unique values and is 100% valid.
  • Title: The text title of the SCP (e.g., "The Thing in the Room").
  • Text: The full body of the main web page, excluding image captions. Paragraphs are joined using newline characters (\n). This content may include elements other than the narrative, such as license notices, and duplicates occur where the site's deletion notice was added as the text.
  • Image Captions: All captions associated with images on the page, joined by newline characters. This attribute has 56% missing values.
  • Rating: A positive or negative integer rating given to the article by users on the site. Values range from -36 to 7663, with a mean of 171. Approximately 6% of records are missing a rating.
  • State: The category the article falls under, which can include 'active', 'deleted', 'blocked', and 'age restricted'. 'active' is the most common state, representing 94% of the data.
  • Tags: Hidden tags (starting with an underscore) and visible tags applied to the article. Approximately 6% of the records are missing tag information.
  • Link: The URL linking directly to the original work. This column contains 6,999 unique URLs and is 100% valid.

Distribution

The material is distributed as a single CSV file named scp6999.csv, with a size of 67.38 MB. The dataset contains 6,999 records. While primary identifier and text fields are 100% valid, attributes such as image captions, rating, and tags contain missing data.

Usage

This resource is ideal for NLP challenges, such as generating realistic SCP text. It can be used to predict the rating an SCP article will receive based on its tags or textual content, or to predict what tags an SCP is given by analyzing the text. The dataset also supports using image captions and text to generate images that correspond to the article content.

Coverage

The material covers the main series SCP articles from series 1 through 7, encompassing numbers 001 to 6999. The scope includes articles that are active, age-restricted, blocked, or deleted. The data is static and no future updates are expected.

License

CC BY-SA 3.0

Who Can Use It

The dataset is intended for researchers and developers focused on NLP, particularly those studying text generation, classification, and analysis within the context of horror fiction and urban legends. It holds a maximum usability rating of 10.00.

Dataset Name Suggestions

  • SCP Foundation Main Series Text Repository
  • High-Quality Text Data for Horror NLP
  • SCP Articles 001 to 6999

Attributes

Listing Stats

VIEWS

7

DOWNLOADS

2

LISTED

17/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format