Andalusian Hotel Opinions Dataset
Reviews & Ratings
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a valuable resource for Natural Language Processing (NLP) research, particularly focusing on the Spanish language. It addresses a recognised scarcity of publicly available corpora for NLP tasks in Spanish, which is spoken by over 493 million native speakers globally [1]. The collection comprises hotel reviews, making it suitable for various sentiment analysis applications. It aims to broaden the reach of NLP technology to a wider Spanish-speaking audience [1].
Columns
- title: The title provided for the hotel review [1, 2].
- rating: The user's rating for the hotel, based on a 5-star scale [1, 2].
- review_text: The full textual content of the review itself [1, 2].
- location: Information referencing the city and region where the hotel is located. Note that reviews originating from the COAH corpus may have 'NaN' values in this column as the original source did not provide this information [1, 2].
- hotel: The name of the hotel being reviewed. Similar to 'location', reviews from the COAH corpus may have 'NaN' values in this column due to a lack of original data [1, 2].
- label: A binary sentiment label for classification. It is important to note that neutral reviews (those with a 3-star rating) are tagged with '3' and must be removed if performing binary classification tasks [1, 2].
Distribution
The dataset contains a total of 18,172 hotel reviews in Spanish [1]. A significant portion, 16,356 reviews, were collected from TripAdvisor in December 2021, with the remainder derived from the COAH corpus, compiled in 2014 [1].
The main dataset exhibits a high degree of class imbalance, largely due to Andalusian hotels generally receiving positive feedback [1]. To facilitate certain analytical tasks, a smaller, balanced version of the dataset is also included, consisting of 7,615 reviews [1, 2]. Data files are typically provided in CSV format [3].
Usage
This dataset is well-suited for a range of sentiment analysis tasks, including:
- Binary sentiment classification: By excluding neutral (3-star) reviews, users can perform binary classification to distinguish between positive and negative sentiments [1].
- Multi-class sentiment classification: The dataset supports classification across multiple sentiment categories, potentially including neutral sentiments [1].
- Prediction of review ratings: It can be used to develop models that predict the star rating a user might assign based on the review text [1].
- Topic modelling on reviews: Analysts can explore prevalent themes and topics discussed within hotel reviews [1].
- Other general sentiment analysis applications [1].
Coverage
- Geographic Scope: The reviews pertain specifically to hotels located in Andalusia, Spain [1]. Location details often specify the city and province, such as 'Seville_Province_of_Seville_Andalucia' [2, 4].
- Time Range: The majority of the reviews were retrieved from TripAdvisor in December 2021 [1]. An additional segment of the data originates from the COAH corpus, which was compiled in 2014 [1].
- Language: All reviews are in Spanish [1].
- Data Availability Notes: For reviews sourced from the COAH corpus, specific hotel names and location details are not available and are thus marked as 'NaN' [1]. The dataset is primarily positive; however, a balanced subset is provided to mitigate issues related to class imbalance [1].
License
CC-BY-NC
Who Can Use It
This dataset is ideal for:
- Natural Language Processing (NLP) researchers: Those working on advancing NLP techniques for languages other than English, specifically Spanish [1].
- Data scientists and machine learning engineers: Individuals developing and testing models for sentiment analysis, text classification, and rating prediction [1].
- Academics: Scholars and students interested in linguistics, computational linguistics, or text mining applied to opinion analysis [1].
- Businesses and developers: Those aiming to build applications or services that require sentiment understanding of Spanish-language reviews, particularly in the hospitality sector [1].
Dataset Name Suggestions
- Andalusian Hotel Reviews for NLP
- Spanish Hotel Sentiment Corpus
- TripAdvisor Andalusian Reviews Dataset
- Spanish Hotel Review Sentiment Data
- Andalusian Hotel Opinions Dataset
Attributes
Original Data Source: Andalusian Hotels’ Reviews