Contextual Language Comprehension Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as HellaSwag (Commonsense NLI), is designed to evaluate a machine's ability to complete sentences in a logically coherent and sensible manner. It provides over 10,000 examples of sentence completion tasks, each featuring an initial sentence segment followed by four potential endings. The primary challenge for an artificial intelligence system is to identify and select the most appropriate ending that best completes the given sentence. This task is particularly demanding for machines because it necessitates an understanding that extends beyond mere word recognition to encompass deeper meaning and contextual nuances. While humans typically find this task straightforward due to their inherent grasp of language and common sense, it presents a significant hurdle for machines. The HellaSwag dataset represents a vital step towards the development of AI systems capable of communicating similarly to humans, offering a benchmark to assess current machine capabilities in language comprehension and generation, and highlighting areas requiring further advancement.
Columns
The dataset typically includes the following columns:
- ind: An integer representing the index of the sentence.
- activity_label: A string indicating the label of the activity.
- ctx_a: A string containing the first context sentence.
- ctx_b: A string containing the second context sentence.
- endings: A string that holds the potential endings for the sentence.
- split: A string denoting the division of the dataset (e.g., training or test set).
- split_type: A string specifying the type of split, such as 'indomain' or 'zeroshot'.
- label: The label indicating which of the possible endings is the correct one for the sentence completion.
Distribution
The dataset is primarily available in a data file format, commonly CSV. It comprises over 10,000 examples of sentence completion. While specific row or record counts for the entire dataset are not explicitly provided, it is structured with context sentences and multiple choice endings. The dataset can be readily split into training and test sets, for instance, using an 80/20 ratio for model development. The 'split' column helps categorise the data, with 'indomain' and 'zeroshot' types each accounting for 50% of the split.
Usage
This dataset is ideally suited for various machine learning and natural language processing applications, including:
- Training models to generate novel sentence endings that mimic human-like creativity and coherence.
- Developing models that enhance their understanding of sentence context, enabling them to select the most appropriate ending based on the given context.
- Building models capable of evaluating two sentences with different endings and determining which one is more probable, drawing upon common-sense knowledge.
Coverage
The dataset is listed with a GLOBAL region scope. No specific geographical, temporal, or demographic coverage details regarding the content of the data itself are provided in the available information. The listing date for the dataset is noted as 17/06/2025.
License
CC0
Who Can Use It
This dataset is invaluable for:
- Data scientists and machine learning engineers working on natural language understanding and generation tasks.
- AI researchers focused on advancing the capabilities of artificial intelligence systems to interact and communicate more human-like.
- Anyone involved in building models for sentence completion, contextual reasoning, and common-sense knowledge integration in AI.
Dataset Name Suggestions
- HellaSwag (Commonsense NLI)
- AI Sentence Completion Challenge
- Contextual Language Comprehension Dataset
- Commonsense Language Understanding Benchmark
Attributes
Original Data Source: HellaSwag (Commonsense NLI)