DATASET
Open Source Community
Weather Captioned Dataset
Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.
Updated 7/26/2024
github
Description
Weather Captioned – First Time Series – Text Multimodal Dataset
Data Sources
- Time‑series data come from the Max‑Planck‑Institute for Biogeochemistry, Jena WS Beutenberg station.
- Weather forecast reports are obtained from publicly available forecasting platforms.
About the Descriptions
- Descriptions are generated from raw data sourced from public weather forecast services.
- No time‑series data were provided to the large language model.
- Descriptions were generated by GPT‑4; the generation script is located at
data_process_scripts/weather_caption.py. - The total cost of generating the dataset descriptions may exceed $400.
- Users are encouraged to generate their own descriptions.
About the Pre‑Embeddings
- Pre‑embedded news texts are provided.
- Embeddings can be downloaded here.
- The embedding generation scripts are located at
data_process_scripts/embedding_caption_local.ipynbanddata_process_scripts/embedding_caption.ipynb.
Data Processing Workflow
- Two hash tables manage news data to align it chronologically with the time series.
- News text embeddings are saved as
.npyfiles, with file names serving as hash keys. - Time‑stamped series segments are converted to a list of news hash keys via the
Date2Hashtable, then embeddings are read via theHash2Embtable. - The
Hash2Texttable can be used to inspect the news text corresponding to a hash key.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Weather Data
Text Generation
Source
Organization: github
Created: 7/26/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.