Opendatabay APP

American English Genre and Rhetoric Analysis

Data Science and Analytics

Tags and Keywords

Corpus

English

Linguistics

Academic

Metadata

Trusted By
Trusted by company1Trusted by company2Trusted by company3
American English Genre and Rhetoric Analysis Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

provides metadata and linguistic feature analysis derived from the widely used COCA corpus. It facilitates the study of English language evolution, genre distinctions, and rhetorical strategies over a thirty-year period. By quantifying elements such as expert vocabulary, citation styles, and confidence markers, it supports research in linguistics, education, and natural language processing.

Columns

  • text_id: Unique identifier for the text segment.
  • text_type: The primary category of the text (e.g., ACAD for Academic).
  • sub_type: Sub-categorisation of the text genre (e.g., Arg for Argumentative).
  • year: The year of publication, ranging from 1990 to 2019.
  • word_count: The total number of words in the specific text segment.
  • Character Types: Metric indicating the diversity or nature of characters used.
  • Citation Authority: Score measuring reliance on authoritative citations.
  • Citation Controversy: Metric assessing the presence of controversial citations.
  • Citation Hedged: Quantification of hedged or cautious citation usage.
  • Citation Neutral: Measure of neutral citation phrasing.
  • Confidence Hedged: Metric indicating tentative or hedged expressions of confidence.
  • Confidence High: Score reflecting high-confidence language.
  • Confidence Low: Score reflecting low-confidence or uncertain language.
  • Expert Vocabulary: Quantification of specialised or domain-specific terminology.
  • Information Comparison: Metric evaluating comparative information structures.
  • Information Quantities: Measurement of quantitative information presentation.
  • Methods Results Discussion: Score related to the structural elements of academic discourse (methods, results, discussion).
  • Purpose Plan: Metric assessing the articulation of purpose or planning in the text.
  • Reader Directed Metadiscourse: Measure of language explicitly guiding the reader.
  • Reader Directed Metadiscourse FP: Specific metric for reader-directed metadiscourse (First Person focus).
  • Reasoning: Quantifiable score of reasoning or argumentative logic present in the text.

Distribution

  • Format: Tabular data (CSV).
  • Size: 4.77 MB.
  • Structure: 42,700 valid records with 21 columns.
  • Data Integrity: 100% valid entries with no missing or mismatched data found in the sample analysis.

Usage

  • Linguistic Research: Analysing shifts in American English vocabulary and grammar over time.
  • Academic Writing Studies: Investigating rhetorical moves (citations, hedging, confidence) across different academic disciplines.
  • Genre Analysis: Comparing stylistic features between fiction, news, and academic texts.
  • Natural Language Processing: Training models to recognise genre-specific attributes or sentiment confidence.

Coverage

  • Geographic Scope: United States (American English).
  • Time Range: 1990 to 2019.
  • Genres: Spoken, Fiction, Popular Magazines, Newspapers, Academic Texts, TV/Movies, Blogs, Web Pages.

License

CC0: Public Domain

Who Can Use It

  • Linguists: For corpus linguistics and diachronic studies.
  • Educators: Developing materials based on real-world usage of academic or spoken English.
  • Data Scientists: For text classification and NLP feature engineering.
  • Students: conducting beginner to advanced research in English language studies.

Dataset Name Suggestions

  • COCA Linguistic Metrics and Metadata 1990-2019
  • American English Genre and Rhetoric Analysis
  • Corpus of Contemporary American English Feature Set
  • 30-Year English Language Variation Metrics

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

07/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in ZIP Format