SCP Foundation Main Series Text Repository
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Data provides a large, high-quality text repository focused on natural language processing (NLP), specifically leaning towards the genres of horror and urban legend. The resource contains main series articles from the SCP Foundation, numbered 001 through 6999. This material captures detailed text, ratings, and tags, facilitating detailed analysis of the structure and reception of speculative fiction content.
Columns
The dataset contains eight attributes detailing each SCP article:
- Code: The formal code name for the SCP (e.g., SCP-007), which is zero-padded if three digits or less. This column has 6,999 unique values and is 100% valid.
- Title: The text title of the SCP (e.g., "The Thing in the Room").
- Text: The full body of the main web page, excluding image captions. Paragraphs are joined using newline characters (
\n). This content may include elements other than the narrative, such as license notices, and duplicates occur where the site's deletion notice was added as the text. - Image Captions: All captions associated with images on the page, joined by newline characters. This attribute has 56% missing values.
- Rating: A positive or negative integer rating given to the article by users on the site. Values range from -36 to 7663, with a mean of 171. Approximately 6% of records are missing a rating.
- State: The category the article falls under, which can include 'active', 'deleted', 'blocked', and 'age restricted'. 'active' is the most common state, representing 94% of the data.
- Tags: Hidden tags (starting with an underscore) and visible tags applied to the article. Approximately 6% of the records are missing tag information.
- Link: The URL linking directly to the original work. This column contains 6,999 unique URLs and is 100% valid.
Distribution
The material is distributed as a single CSV file named
scp6999.csv, with a size of 67.38 MB. The dataset contains 6,999 records. While primary identifier and text fields are 100% valid, attributes such as image captions, rating, and tags contain missing data.Usage
This resource is ideal for NLP challenges, such as generating realistic SCP text. It can be used to predict the rating an SCP article will receive based on its tags or textual content, or to predict what tags an SCP is given by analyzing the text. The dataset also supports using image captions and text to generate images that correspond to the article content.
Coverage
The material covers the main series SCP articles from series 1 through 7, encompassing numbers 001 to 6999. The scope includes articles that are active, age-restricted, blocked, or deleted. The data is static and no future updates are expected.
License
CC BY-SA 3.0
Who Can Use It
The dataset is intended for researchers and developers focused on NLP, particularly those studying text generation, classification, and analysis within the context of horror fiction and urban legends. It holds a maximum usability rating of 10.00.
Dataset Name Suggestions
- SCP Foundation Main Series Text Repository
- High-Quality Text Data for Horror NLP
- SCP Articles 001 to 6999
Attributes
Original Data Source: SCP Foundation Main Series Text Repository
Loading...
