Back to datasets
Dataset assetOpen Source CommunityText GenerationWeather Data
Weather Captioned Dataset
Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.
Source
github
Created
Jul 26, 2024
Updated
Jul 26, 2024
Signals
118 views
Availability
Linked source ready
Overview
Dataset description and usage context
Weather Captioned – First Time Series – Text Multimodal Dataset
Data Sources
- Time‑series data come from the Max‑Planck‑Institute for Biogeochemistry, Jena WS Beutenberg station.
- Weather forecast reports are obtained from publicly available forecasting platforms.
About the Descriptions
- Descriptions are generated from raw data sourced from public weather forecast services.
- No time‑series data were provided to the large language model.
- Descriptions were generated by GPT‑4; the generation script is located at
data_process_scripts/weather_caption.py. - The total cost of generating the dataset descriptions may exceed $400.
- Users are encouraged to generate their own descriptions.
About the Pre‑Embeddings
- Pre‑embedded news texts are provided.
- Embeddings can be downloaded here.
- The embedding generation scripts are located at
data_process_scripts/embedding_caption_local.ipynbanddata_process_scripts/embedding_caption.ipynb.
Data Processing Workflow
- Two hash tables manage news data to align it chronologically with the time series.
- News text embeddings are saved as
.npyfiles, with file names serving as hash keys. - Time‑stamped series segments are converted to a list of news hash keys via the
Date2Hashtable, then embeddings are read via theHash2Embtable. - The
Hash2Texttable can be used to inspect the news text corresponding to a hash key.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.