Back to datasets
Dataset assetOpen Source CommunityText GenerationWeather Data

Weather Captioned Dataset

Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.

Source
github
Created
Jul 26, 2024
Updated
Jul 26, 2024
Signals
118 views
Availability
Linked source ready
Overview

Dataset description and usage context

Weather Captioned – First Time Series – Text Multimodal Dataset

Data Sources

About the Descriptions

  • Descriptions are generated from raw data sourced from public weather forecast services.
  • No time‑series data were provided to the large language model.
  • Descriptions were generated by GPT‑4; the generation script is located at data_process_scripts/weather_caption.py.
  • The total cost of generating the dataset descriptions may exceed $400.
  • Users are encouraged to generate their own descriptions.

About the Pre‑Embeddings

  • Pre‑embedded news texts are provided.
  • Embeddings can be downloaded here.
  • The embedding generation scripts are located at data_process_scripts/embedding_caption_local.ipynb and data_process_scripts/embedding_caption.ipynb.

Data Processing Workflow

  • Two hash tables manage news data to align it chronologically with the time series.
  • News text embeddings are saved as .npy files, with file names serving as hash keys.
  • Time‑stamped series segments are converted to a list of news hash keys via the Date2Hash table, then embeddings are read via the Hash2Emb table.
  • The Hash2Text table can be used to inspect the news text corresponding to a hash key.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio