Dark Mode

Home

Data Categories

AI & ML Data

Historical Clay Tablet Text Samples

FREE DATASET LIBRARY

Verified Data Provider

£0

Historical Clay Tablet Text Samples

Data Science and Analytics

Tags and Keywords

Cuneiform

Sumerian

Babylonian

Akkadian

Dialect

Trusted By

Historical Clay Tablet Text Samples Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This collection of data supports the crucial task of detecting the language or dialect used in ancient clay tablet writings. The focus is on Cuneiform, which holds the distinction of being possibly the world's oldest writing system, invented by the ancient Sumerians in Mesopotamia (present-day Iraq) in the early 4th millennium BC. Cuneiform remained in use for approximately 4,000 years, initially serving to record Sumerian before being adapted for several other regional languages, most notably Akkadian or Babylonian in various dialects. The historical context is rich, with Sumerian epic poems recounting that writing originated because a messenger’s mouth was too "heavy" to repeat the message, necessitating the use of clay tablets for communication.

Columns

cuneiform: This column provides the Unicode representation of the cuneiform text snippet itself.
lang: This column specifies the distinct language or specific dialect attributed to the corresponding cuneiform text.

Distribution

The information is provided in a single data file, train.csv, which measures 4.63 MB. The file contains 134,000 individual snippets of cuneiform texts. There are seven unique languages or dialects identified within the dataset. Sumerian is considered a distinct language from the remaining six, which are categorised as dialects of Akkadian (including Neo-Assyrian, Standard Babylonian, Late Babylonian, Neo-Babylonian, Middle Babylonian Peripheral, and Old Babylonian). The quality metrics show 100% valid records with no missing or mismatched values. The expected update frequency for this dataset is never.

Usage

This data product is highly suitable for various natural language processing and linguistic studies. It was originally utilised in a multiclass classification task aimed at accurately identifying the language and specific dialect in use. Other applicable tasks include binary classification, such as determining if a text is Sumerian or not (language detection without dialect specification). Furthermore, the dataset provides a valuable basis for internal analysis of the text snippets to study structure, such as which logograms tend to cluster together in specific languages, or how logograms appear or fall out of usage across later dialects.

Coverage

The geographic scope of the texts is rooted in Mesopotamia (modern Iraq). The subject matter covers languages used across a vast historical time span of around 4,000 years, beginning in the early 4th millennium BC. Linguistic coverage includes seven distinct codes: Sumerian (SUX), Neo-Assyrian (NEA), Standard Babylonian (STB), Late Babylonian (LTB), Neo-Babylonian (NEB), Middle Babylonian Peripheral (MPB), and Old Babylonian (OLB).

License

CC BY-SA 3.0

Who Can Use It

Data Scientists/Machine Learning Engineers: For training and evaluating language identification models, especially in multiclass and binary classification settings.
Linguists and Historians: For conducting quantitative analyses on dialect shifts, vocabulary use, and structural changes in ancient written languages.
NLP Researchers: Those interested in extending current methods to handle highly ancient and historical script data represented in Unicode.

Dataset Name Suggestions

Cuneiform Language Identification Snippets
Ancient Mesopotamian Dialect Corpus
Akkadian and Sumerian Language Recogniser Data
Historical Clay Tablet Text Samples

Attributes

Original Data Source: Historical Clay Tablet Text Samples

Listing Stats

VIEWS

DOWNLOADS

LISTED

02/11/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...