JUHE API Marketplace
DATASET
Open Source Community

Weather Captioned Dataset

Weather Captioned Dataset is a multimodal time‑series text dataset that includes weather data from the Max‑Planck‑Institute for Biogeochemistry, Jena, and forecast reports from publicly available weather forecasting platforms. The dataset also provides GPT‑4‑generated descriptions and pre‑embedded news texts.

Updated 7/26/2024
github

Description

Weather Captioned – First Time Series – Text Multimodal Dataset

Data Sources

About the Descriptions

  • Descriptions are generated from raw data sourced from public weather forecast services.
  • No time‑series data were provided to the large language model.
  • Descriptions were generated by GPT‑4; the generation script is located at data_process_scripts/weather_caption.py.
  • The total cost of generating the dataset descriptions may exceed $400.
  • Users are encouraged to generate their own descriptions.

About the Pre‑Embeddings

  • Pre‑embedded news texts are provided.
  • Embeddings can be downloaded here.
  • The embedding generation scripts are located at data_process_scripts/embedding_caption_local.ipynb and data_process_scripts/embedding_caption.ipynb.

Data Processing Workflow

  • Two hash tables manage news data to align it chronologically with the time series.
  • News text embeddings are saved as .npy files, with file names serving as hash keys.
  • Time‑stamped series segments are converted to a list of news hash keys via the Date2Hash table, then embeddings are read via the Hash2Emb table.
  • The Hash2Text table can be used to inspect the news text corresponding to a hash key.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Weather Data
Text Generation

Source

Organization: github

Created: 7/26/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.