Dark Mode

Home

Data Categories

AI & ML Data

Italian Dialectal Tweet Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Italian Dialectal Tweet Dataset

Data Science and Analytics

Tags and Keywords

Literature

Nlp

Data

Cleaning

Text

Mining

Trusted By

Italian Dialectal Tweet Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as the Twitter Italian Negation (TIN) Corpus, offers an intriguing insight into language change within Romance languages, particularly focusing on the emergence of non-standard negation usage. It comprises 10,000 tweets collected from ten diverse cities, including Milan, Rome, Naples, and New York City, gathered between August and December 2019. The data provides tokenised text, frequency measures, and city metadata, enabling users to explore regional linguistic variations. This valuable resource facilitates the discovery of how language evolves over time in these specific urban contexts and how usage may differ across geographic locations. It is an excellent tool for understanding the dynamic shifts between spoken and written forms of Italian.

Columns

The dataset is structured with the following key columns, designed to support in-depth linguistic analysis:

Tok (Tokenised text): This column contains text broken down into individual words or tokens, representing all words, including punctuation marks, found within a specific tweet.
Abs (Absolute Frequency): Represents the total number of times a particular token appears across all tweets in the dataset.
Rel (Relative Frequency): Indicates how frequently a specific token appears in comparison to other tokens within the dataset.
Var (Variation): Signifies any alterations made to a token compared to its standard usage, for instance, a non-standard spelling or grammatical construction.
City: Identifies the city from which the tweet originated, allowing for analysis of usage differences across various locales.

Distribution

This dataset contains 10,000 tweets. Each tweet can contain multiple tokens, with each row representing a single token. The data was collected from August to December 2019. The specific format of the data file is not detailed, but it is typically provided as a CSV.

Usage

This dataset is ideal for various analytical applications and research initiatives:

Regional Variation Studies: Analyse and compare patterns of language usage across different cities or within a specific city.
Linguistic Evolution Tracking: Investigate language change over time by monitoring shifts in relative and absolute frequencies of negation constructions in tweets.
Dialectal Differences: Compare variations within tokens across different cities to understand how specific linguistic constructions are employed differently across regions or dialects.
Literary Analysis: Examine trends in language usage within literary contexts, such as poetry, by identifying commonly used words and phrases over time.
Socio-economic Impact: Explore how diverse socio-economic contexts or cultural trends (e.g., news, fashion, sports) have influenced the evolution of language use in tweets within each city.

Coverage

The dataset's scope includes:

Geographic Scope: Tweets collected from ten distinct cities: Milan, Rome, Naples, Palermo, Bologna, Turin, Florence, Cagliari, Genoa, and New York City. This allows for both intra-Italian regional analysis and a comparison with a major US city.
Time Range: Data was gathered between August and December 2019.
Demographic Scope: The dataset consists of tweets written in Italian, reflecting contemporary language usage from social media users in these specified urban areas.

License

CC0

Who Can Use It

This dataset is particularly valuable for:

Linguists and Philologists: To study language evolution, dialectal variations, and non-standard grammatical constructions in Romance languages.
Data Scientists and Analysts: To apply natural language processing (NLP) techniques for text analysis, frequency analysis, and pattern recognition in social media data.
Social Scientists and Urban Researchers: To explore the connection between urban environments, cultural trends, and linguistic expression.
Academics and Students: For research projects, dissertations, and learning about real-world language datasets.

Dataset Name Suggestions

Italian Tweet Negations Corpus
TIN Corpus: Italian Tweets
Urban Italian Language Shifts
Romance Language Negation in Tweets
Italian Dialectal Tweet Data

Attributes

Original Data Source: Italian Negation Constructions - Tweets

Listing Stats

VIEWS

DOWNLOADS

LISTED

27/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...