Italian Dialectal Tweet Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as the Twitter Italian Negation (TIN) Corpus, offers an intriguing insight into language change within Romance languages, particularly focusing on the emergence of non-standard negation usage. It comprises 10,000 tweets collected from ten diverse cities, including Milan, Rome, Naples, and New York City, gathered between August and December 2019. The data provides tokenised text, frequency measures, and city metadata, enabling users to explore regional linguistic variations. This valuable resource facilitates the discovery of how language evolves over time in these specific urban contexts and how usage may differ across geographic locations. It is an excellent tool for understanding the dynamic shifts between spoken and written forms of Italian.
Columns
The dataset is structured with the following key columns, designed to support in-depth linguistic analysis:
- Tok (Tokenised text): This column contains text broken down into individual words or tokens, representing all words, including punctuation marks, found within a specific tweet.
- Abs (Absolute Frequency): Represents the total number of times a particular token appears across all tweets in the dataset.
- Rel (Relative Frequency): Indicates how frequently a specific token appears in comparison to other tokens within the dataset.
- Var (Variation): Signifies any alterations made to a token compared to its standard usage, for instance, a non-standard spelling or grammatical construction.
- City: Identifies the city from which the tweet originated, allowing for analysis of usage differences across various locales.
Distribution
This dataset contains 10,000 tweets. Each tweet can contain multiple tokens, with each row representing a single token. The data was collected from August to December 2019. The specific format of the data file is not detailed, but it is typically provided as a CSV.
Usage
This dataset is ideal for various analytical applications and research initiatives:
- Regional Variation Studies: Analyse and compare patterns of language usage across different cities or within a specific city.
- Linguistic Evolution Tracking: Investigate language change over time by monitoring shifts in relative and absolute frequencies of negation constructions in tweets.
- Dialectal Differences: Compare variations within tokens across different cities to understand how specific linguistic constructions are employed differently across regions or dialects.
- Literary Analysis: Examine trends in language usage within literary contexts, such as poetry, by identifying commonly used words and phrases over time.
- Socio-economic Impact: Explore how diverse socio-economic contexts or cultural trends (e.g., news, fashion, sports) have influenced the evolution of language use in tweets within each city.
Coverage
The dataset's scope includes:
- Geographic Scope: Tweets collected from ten distinct cities: Milan, Rome, Naples, Palermo, Bologna, Turin, Florence, Cagliari, Genoa, and New York City. This allows for both intra-Italian regional analysis and a comparison with a major US city.
- Time Range: Data was gathered between August and December 2019.
- Demographic Scope: The dataset consists of tweets written in Italian, reflecting contemporary language usage from social media users in these specified urban areas.
License
CC0
Who Can Use It
This dataset is particularly valuable for:
- Linguists and Philologists: To study language evolution, dialectal variations, and non-standard grammatical constructions in Romance languages.
- Data Scientists and Analysts: To apply natural language processing (NLP) techniques for text analysis, frequency analysis, and pattern recognition in social media data.
- Social Scientists and Urban Researchers: To explore the connection between urban environments, cultural trends, and linguistic expression.
- Academics and Students: For research projects, dissertations, and learning about real-world language datasets.
Dataset Name Suggestions
- Italian Tweet Negations Corpus
- TIN Corpus: Italian Tweets
- Urban Italian Language Shifts
- Romance Language Negation in Tweets
- Italian Dialectal Tweet Data
Attributes
Original Data Source: Italian Negation Constructions - Tweets