Global LLM Release Metrics
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Detailed historical data on every major Large Language Model (LLM) and chatbot released between 2018 and 2024. The dataset provides essential technical specifications critical for understanding the development, growth, and complexity of modern artificial intelligence. It serves as a resource for tracking industry evolution by detailing model parameters, token counts, training data, and associated companies.
Columns
The dataset contains 11 specific fields detailing model information:
- Model: The official designation or name of the language model.
- Company: The corporation or entity responsible for developing the model. Google and Meta AI are frequently represented.
- Arch: Describes the underlying model architecture, such as Transformer or Recurrent Neural Network (RNN). Some values are designated as To Be Announced (TBA).
- Parameters: The measure of the model's complexity, expressed in billions of weights.
- Tokens: The volume of sub-word units the model was trained on or can process, recorded in billions. This field has a significant number of missing values (25%).
- Ratio: Typically indicates the ratio of parameters to tokens, though specific values are rare (e.g., 20:01 for Olympus).
- ALScore: A calculated metric intended as a quick rating of the model's power, derived from the square root of (Parameters multiplied by Tokens).
- Training dataset: The primary data sources used to train the model, often listing resources like Wikipedia, books, and common crawl data.
- Release Date: The anticipated or confirmed date when the model was made available.
- Notes: Provides supplementary details, such as whether the model functions as a Chatbot.
- Playground: A URL linking to a site where users can interact with the model or find additional information.
Distribution
The information is structured as a CSV file named "Large language models (2024).csv". It contains 342 total valid records across 11 columns. Data is expected to be updated on a quarterly basis. It is important to note that certain metrics, such as Tokens, Ratio, and ALScore, have substantial portions of missing values.
Usage
This data product is ideally suited for benchmarking and comparing Large Language Models across various metrics like size, complexity, and training data source. It is perfect for tracking trends in AI development over the period 2018 to 2024, identifying dominant architectures like Dense models, and performing historical analysis of corporate involvement in AI. It is highly useful for generating industry reports and visualisations showing the growth curve of LLM capabilities.
Coverage
The temporal scope of the data encompasses the years 2018 through 2024, covering releases within that period. The data pertains to globally released major models and chatbots. The scope focuses purely on technological specifications and corporate development entities rather than geographic or demographic variables.
License
CC0: Public Domain
Who Can Use It
The dataset holds high value (rated 10.00 for usability) for several key user groups:
- Artificial Intelligence Researchers: For studying the relationship between parameter count, token volume, and model release date.
- Data Scientists: For advanced modelling and predictive analysis related to AI growth trajectories.
- Academics and Students: For educational purposes, specifically understanding LLM taxonomy and architecture types.
- Industry Analysts: For tracking the activities of key companies like Google and Meta AI in the LLM space.
Dataset Name Suggestions
- AI Language Model Specifications 2018-2024
- Global LLM Release Metrics
- Major Chatbot and LLM Technical History
Attributes
Original Data Source: Global LLM Release Metrics