Data Automation and Export System / Sustax Automation

Sustax Automation

Last update: August 27, 2025

In:

Interactive Index

  1. The CSV Output File
  2. Import Data with M. Excel
  3. Import Data with Python
  4. Import Data with R
  5. Importing and Interpreting with Large Language Models
  6. Sustax User’s API

The CSV Output File

Each Sustax request will return 1 CSV file per gridcell or POI (Point of Interest), encoded in UTF-8, with fields separated by commas. If the user ask for monthly data and daily data in the same request, it gets 2 CSV files. Each CSV consists of three main structures: 

Piping hot image of climate data and sustainability analytics dashboard for environmental monitoring and analysis at Sustax, improving transparency and decision-making in climate science.
Visualitzation of the generated CSV file for monthly data

1. Metadata (free of cost)

Includes the CSV creation date, a precise location in geographical coordinates (WGS84), an approximate Open Street Map location, the climate variables requested (short name, long name and units), the accuracy metrics requested (short name and long name) and the scenarios requested (SSPs or ERA5). Each requests also includes free of cost a set of static variables representing the interpreted geophysical features of the gridcell, which includes:

  • CVH. Estimated high cover vegetation fraction (%)
  • CVL. Estimated low cover vegetation fraction (%)
  • TVH. Type of high vegetation (~)
  • TVL. Type of low vegetation (~)
  • SLT. Soil type (~)
  • Z. Estimated geopotential height, (m)
  • SDOR. Standard deviation of sub-gridscale orography, (~)
  • SLOR. Slope of sub-gridscale orography, (~)

2. Accuracy metrics (payload)

The predictive accuracy metrics have a key characteristic: they are timeless, meaning there are 5 metrics with a single value for each climate variable and each SSP-RCP scenario.

3. Climate Data (payload) 

The climate data consist of time series, which can be at daily or monthly resolution. For each climate scenario and variable, a time series is generated that can contain up to 102 years of data.  Since users can request data with monthly or daily time stamps. Two separate CSVs are generated if the user’s request includes datasets with both temporal resolutions. 

If the user requests an ROI (Region of Interest) instead of a POI (Point of Interest), it will receive multiple CSVs—one for each POI contained within the requested ROI. In any case, the requested CSV(s) are delivered in a compressed ZIP file.

Import Data with M. Excel 

As Sustax’s CSV encoding uses UTF-8 with BOM, Excel usually detects it automatically. 

Desktop Excel 

  • Open Excel (desktop version).
  • Go to Data → Get Data → From Text/CSV.
  • Select your CSV file and click Import.
  • In the Import Wizard:
    • File Origin: Choose 65001: Unicode (UTF-8).
    • Delimiter: Select Comma (,).
  • Click Load (or Transform Data if you want to edit in Power Query).
  • Your data will appear in separate columns.

Online Excel 

  • Upload the CSV to OneDrive.
  • Open it in Excel Online.
  • If Excel detects the delimiter automatically, it will split columns.
  • If not, select the entire Text Column:
    • Click on Data tab
    • On Data Tools, click on Split Text to Columns, consider the delimiter is a comma (,).

If you see strange characters in the metadata (e.g., ñ, accents, foreign ANSI characters), ensure you selected UTF-8 in the import system. If your Excel is not formatted with dots ‘.’ as decimal separator, the data will be wrongly displayed (e.g. Spanish Excels and the ‘,’ as decimal separator)

Import Data with Python

It requires Python 3.6 or above with two external libraries: 

  • Numpy 
  • Pandas 

The script contains a single function with all the imports necessary inside. 


def load_sustax_file(csv_stx, return_pandas_df = True, return_metadata = False):
    """
    Pull Sustax CSV data to Python's IDE as numpy arrays or as panda dataframes

    Parameters:
    ----------
        - csv_stx: Str. or Path obj. Full file name pointing to Sustax CSV
        - return_pandas_df: Bool. Use bool(True) to get outputs as Pandas obj.
        - return_metadata: Bool. Use bool(True) to get CSV metadata

    Returns:
    -------
        Returns two, three or four objects depending on `return_pandas_df` and
        `return_metadata` parameters:
            1. Climate data: All `Data requested` field from the CSV file
            2. Accuracy metrics: All `Metrics requested` field from the CSV file
            3. Climate data Time Stamps: All timesteps in `Data requested` field
            4. Metadata: Dictionary object containg all the CSV metadata
        The climate data and accuracy metrics can be returned as dictionaries
        with arrays or as Pandas dataframe
        If `return_pandas_df` is requested, Time Stamps are merged with the
        climate data automatically

    External packages required:
    --------------------------
        - Numpy
        - Pandas
    """

    # Imports "inside" for compactness. We encourage to place all imports
    # at the beginning of your module / script file
    import csv
    import decimal
    import numpy as np
    import pandas as pd
    from datetime import datetime
    def _isfloat(string):
        try:
            decimal.Decimal(string)
            return True
        except decimal.InvalidOperation:
            return False
   
    with open(csv_stx, 'r', encoding = None) as fobj:
        data = csv.reader(fobj, delimiter = ',')
        data = [i for i in data]
    if return_metadata:
        mt = {}
        for r in data:
            mt["creation_date"] = data[1][1]
            if any(['longitude' in h.lower() for h in r]):
                mt["lon"] = [float(rr) for rr in r if _isfloat(rr)][0]
            if any(['latitude' in h.lower() for h in r]):
                mt["lat"] = [float(rr) for rr in r if _isfloat(rr)][0]
            if any(['soil variables' in h.lower() for h in r]):
                idx_loc = [c for c in range(len(data)) if 'Soil variables:' in data[c]][0]
                mt["soil_data"] = {}
                for c in range(1, len(r)):
                    if r[c]:
                        mt["soil_data"].update({data[idx_loc][c]:data[idx_loc+1][c]})
       
    # Check if daily or monthly dataset, get the whole string
    dts_type = [d for d in data if (len(d) > 0) and (d[0] == 'Dataset:')][0][1]
    # Get climate data
    # Four rows below to get the metric values
    idx_data = [c for c in range(len(data))\
                if 'Data requested:' in data[c]][0] + 4
    all_data = data[idx_data:]
    all_data_vars = data[idx_data-3]
    all_data_scenarios = data[idx_data-2]
    if return_metadata:
        mt["var_us"] = [r for r in data[idx_data-1] if r]
        mt["var_scs"] = [r for r in data[idx_data-2] if r and (r.lower() != 'time')]
        mt["var_nm"] = [r.split('-')[0] for r in data[idx_data-3] if r]
        try:
            # In this case returns the unique long name
            idx_long = [c for c in range(len(data)) if 'Climate variables long name:' in data[c]][0]
            mt["var_l_nm"] = data[idx_long][1].split(',')
        except (IndexError, TypeError):
            pass
    # Get metrics data
    if [d for d in data if (len(d) > 0) and\
                           ('Metric variables:' in d)][0][1] != '':
        # Two rows below to get the metric values
        # Five row up to get to the last metric
        idx_metrics = [c for c in range(len(data))\
                       if 'Metrics requested:' in data[c]][0] + 2
        all_metrics = data[idx_metrics:idx_data - 5]
    else:
        all_metrics = {}
    dt_metrics = {}
    for r in all_metrics:
        metric_nm = r[0].split(' - ')[0]
        var_nm = r[0].split(' - ')[1]
        if var_nm not in dt_metrics:
            dt_metrics.update({var_nm:{}})
        dt_metrics[var_nm].update({metric_nm:{}})
        for c in range(1, len(r)):
            if _isfloat(r[c]):
                dt_metrics[var_nm][metric_nm][all_data_scenarios[c]] = float(r[c])
   
    # Get climate payload data
    dt = {}
    for c in range(len(all_data_vars)):
        if all_data_vars[c] != '':
            dt.setdefault(all_data_vars[c],{}).update({all_data_scenarios[c]:[]})
   
    # Load all data to local variables
    time = []
    for r in all_data:
        for c in range(len(r)):
            if ("SSP" in all_data_scenarios[c]) or ("ERA" in all_data_scenarios[c]):
                dt[all_data_vars[c]][all_data_scenarios[c]].append(float(r[c]) if r[c]!='' else float('NaN'))
        time.append(datetime.strptime(r[0], '%Y/%m' if 'monthly' in dts_type.lower()\
                                                    else "%Y/%m/%d"))
    # Payload data to array
    time = np.array(time, dtype = f'datetime64[{"M" if "monthly" in dts_type.lower() else "D"}]')
    dt = {k: {s: np.asarray(dt[k][s]) for s in dt[k]} for k in dt}
    if return_pandas_df:
        all_dfs = [pd.Series({d: v for d, v in zip(time, dt[var][s])}).to_frame(name = f"{var} [{s}]")\
                   for var in dt for s in dt[var]]
        df_vals = all_dfs[0].join(all_dfs[1:])
        # Return empty DF if no metrics
        all_dfs = [pd.Series({v: dt_metrics[var][s][v] for v in dt_metrics[var][s]}).to_frame(name = f"{var} [{s}]")\
                   for var in dt_metrics for s in dt_metrics[var]]
        # Avoid empty metrics if csv does not contain them
        df_metrics = all_dfs[0].join(all_dfs[1:]) if all_dfs else pd.DataFrame()
       
        if return_metadata:
            return df_vals, df_metrics, mt
        else:
            return df_vals, df_metrics
    else:
        if return_metadata:
            return dt, dt_metrics, time, mt
        else:
            return dt, dt_metrics, time

Import Data with R

The code below requires R 4.4.1. or above:


load_sustax_file <- function(csv_stx, return_metadata = FALSE) {
  #' Load Sustax CSV data into R with multi-level column names
  #'
  #' @param csv_stx Character string. Full file path to Sustax CSV
  #' @param return_metadata Logical. TRUE to include metadata (default: FALSE)
  #' @return List with climate_data, accuracy_metrics, and metadata
  #'   Both data frames have multi-level column names

  is_float <- function(x) {
    !is.na(suppressWarnings(as.numeric(x))) && nchar(trimws(x)) > 0
  }

  con <- file(csv_stx, "r", encoding = "UTF-8")
  lines <- readLines(con, warn = FALSE)
  close(con)

  data <- lapply(lines, function(line) {
    con_line <- textConnection(line)
    parsed <- tryCatch({
      as.character(read.csv(con_line, header = FALSE, stringsAsFactors = FALSE, 
                           check.names = FALSE, na.strings = NULL)[1, ])
    }, error = function(e) strsplit(line, ",")[[1]])
    close(con_line)
    return(parsed)
  })

  # Extract metadata (if requested)
  if (return_metadata) {
    mt <- list()
    for (i in seq_along(data)) {
      r <- data[[i]]
      if (i == 2 && length(r) >= 2) mt$creation_date <- r[2]
      if (any(grepl("longitude", r, ignore.case = TRUE))) {
        nums <- suppressWarnings(as.numeric(r))
        nums <- nums[!is.na(nums)]
        if (length(nums) > 0) mt$lon <- nums[1]
      }
      if (any(grepl("latitude", r, ignore.case = TRUE))) {
        nums <- suppressWarnings(as.numeric(r))
        nums <- nums[!is.na(nums)]
        if (length(nums) > 0) mt$lat <- nums[1]
      }
      if (any(grepl("soil variables", r, ignore.case = TRUE))) {
        idx_loc <- which(sapply(data, function(x) any(grepl("^Soil variables:", x))))[1]
        if (!is.na(idx_loc) && idx_loc < length(data)) {
          soil_header <- data[[idx_loc]]
          soil_values <- data[[idx_loc + 1]]
          mt$soil_data <- list()
          for (c in 2:length(soil_header)) {
            if (c <= length(soil_values) && 
                nchar(trimws(soil_header[c])) > 0 && 
                nchar(trimws(soil_values[c])) > 0) {
              mt$soil_data[[soil_header[c]]] <- soil_values[c]
            }
          }
        }
      }
    }
  }

  dts_type <- ""
  for (d in data) {
    if (length(d) > 0 && d[1] == "Dataset:") {
      dts_type <- d[2]
      break
    }
  }

  idx_data <- which(sapply(data, function(x) any(grepl("^Data requested:", x))))[1] + 4
  all_data_vars <- data[[idx_data - 3]]
  all_data_scenarios <- data[[idx_data - 2]]
  all_data_units <- data[[idx_data - 1]]

  # Store variable metadata
  if (return_metadata) {
    mt$var_us <- all_data_units[nchar(trimws(all_data_units)) > 0]
    mt$var_scs <- all_data_scenarios[nchar(trimws(all_data_scenarios)) > 0 & 
                                     tolower(all_data_scenarios) != "time"]
    mt$var_nm <- sapply(all_data_vars[nchar(trimws(all_data_vars)) > 0], function(x) {
      if (grepl("-", x, fixed = TRUE)) strsplit(x, "-", fixed = TRUE)[[1]][1] else x
    }, USE.NAMES = FALSE)
    idx_long <- which(sapply(data, function(x) 
      length(x) > 0 && any(grepl("Climate variables long name:", x))))[1]
    if (!is.na(idx_long)) {
      long_str <- paste(data[[idx_long]][-1], collapse = ",")
      long_str <- gsub('"', '', long_str)
      mt$var_l_nm <- strsplit(long_str, ",")[[1]]
    }
  }

  # Check for metrics
  metric_var_idx <- which(sapply(data, function(x) 
    length(x) > 0 && any(grepl("^Metric variables:", x))))[1]
  has_metrics <- FALSE
  if (!is.na(metric_var_idx)) {
    metric_check <- data[[metric_var_idx]]
    if (length(metric_check) > 1 && nchar(trimws(metric_check[2])) > 0) {
      has_metrics <- TRUE
    }
  }

  # Parse METRICS with multi-level headers
  dt_metrics <- list()
  metrics_col_info <- list()

  if (has_metrics) {
    idx_metrics <- which(sapply(data, function(x) 
      any(grepl("^Metrics requested:", x))))[1] + 2
    if (!is.na(idx_metrics) && idx_metrics < idx_data - 5) {
      all_metrics <- data[idx_metrics:(idx_data - 5)]

      for (r in all_metrics) {
        if (length(r) > 0 && nchar(trimws(r[1])) > 0 && grepl(" - ", r[1], fixed = TRUE)) {
          parts <- strsplit(r[1], " - ", fixed = TRUE)[[1]]
          metric_nm <- parts[1]
          var_nm <- parts[2]

          key <- paste0(var_nm, "_", metric_nm)
          dt_metrics[[key]] <- list()
          metrics_col_info[[key]] <- list(variable = var_nm, metric = metric_nm)

          for (c in 2:length(r)) {
            if (c <= length(all_data_scenarios) && is_float(r[c])) {
              scenario <- all_data_scenarios[c]
              if (nchar(trimws(scenario)) > 0) {
                dt_metrics[[key]][[scenario]] <- as.numeric(r[c])
              }
            }
          }
        }
      }
    }
  }

  # Parse CLIMATE data
  all_data <- data[idx_data:length(data)]
  dt <- list()
  col_info <- list()

  for (c in seq_along(all_data_vars)) {
    var_name <- all_data_vars[c]
    scenario <- all_data_scenarios[c]

    if (nchar(trimws(var_name)) > 0 && 
        nchar(trimws(scenario)) > 0 &&
        (grepl("SSP", scenario) || grepl("ERA", scenario))) {

      key <- paste0(var_name, "_", scenario)
      dt[[key]] <- numeric(0)
      col_info[[key]] <- list(variable = var_name, scenario = scenario)
    }
  }

  # Extract timestamps and values
  time <- character(0)
  is_monthly <- grepl("monthly", dts_type, ignore.case = TRUE)

  for (r in all_data) {
    if (length(r) > 0 && nchar(trimws(r[1])) > 0) {
      time <- c(time, r[1])

      for (c in 2:length(r)) {
        if (c <= length(all_data_scenarios) && c <= length(all_data_vars)) {
          var_name <- all_data_vars[c]
          scenario <- all_data_scenarios[c]

          if (nchar(trimws(var_name)) > 0 && 
              nchar(trimws(scenario)) > 0 &&
              (grepl("SSP", scenario) || grepl("ERA", scenario))) {

            key <- paste0(var_name, "_", scenario)
            value <- ifelse(nchar(trimws(r[c])) > 0, 
                          suppressWarnings(as.numeric(r[c])), 
                          NA_real_)

            if (key %in% names(dt)) {
              dt[[key]] <- c(dt[[key]], value)
            }
          }
        }
      }
    }
  }

  # Convert timestamps
  if (is_monthly) {
    time_parsed <- as.Date(paste0(time, "/01"), format = "%Y/%m/%d")
  } else {
    time_parsed <- as.Date(time, format = "%Y/%m/%d")
  }

  # Build CLIMATE dataframe with multi-level headers
  if (length(dt) > 0) {
    df_list <- list(Time = time_parsed)

    for (key in names(dt)) {
      full_var <- col_info[[key]]$variable
      if (grepl("-", full_var, fixed = TRUE)) {
        short_var <- strsplit(full_var, "-", fixed = TRUE)[[1]][1]
      } else {
        short_var <- full_var
      }
      scenario <- col_info[[key]]$scenario

      col_name <- paste0(short_var, "\n", scenario)
      df_list[[col_name]] <- dt[[key]]
    }

    df_vals <- as.data.frame(df_list, stringsAsFactors = FALSE)

    var_names <- c("Time", sapply(names(dt), function(key) {
      full_var <- col_info[[key]]$variable
      if (grepl("-", full_var, fixed = TRUE)) {
        strsplit(full_var, "-", fixed = TRUE)[[1]][1]
      } else {
        full_var
      }
    }))
    scenario_names <- c("", sapply(names(dt), function(key) col_info[[key]]$scenario))

    attr(df_vals, "variable_names") <- var_names
    attr(df_vals, "scenario_names") <- scenario_names

  } else {
    df_vals <- data.frame(Time = time_parsed)
  }

  # Build METRICS dataframe with multi-level headers
  if (length(dt_metrics) > 0) {
    all_scenarios <- unique(unlist(lapply(dt_metrics, names)))
    metrics_df_list <- list()

    for (key in names(dt_metrics)) {
      var_nm <- metrics_col_info[[key]]$variable
      metric_nm <- metrics_col_info[[key]]$metric

      col_name <- paste0(var_nm, "\n", metric_nm)

      values <- sapply(all_scenarios, function(sc) {
        if (sc %in% names(dt_metrics[[key]])) {
          dt_metrics[[key]][[sc]]
        } else {
          NA_real_
        }
      })

      metrics_df_list[[col_name]] <- values
    }

    df_metrics <- as.data.frame(metrics_df_list, stringsAsFactors = FALSE)
    rownames(df_metrics) <- all_scenarios

    var_names_metrics <- sapply(names(dt_metrics), function(key) metrics_col_info[[key]]$variable)
    metric_names_metrics <- sapply(names(dt_metrics), function(key) metrics_col_info[[key]]$metric)

    attr(df_metrics, "variable_names") <- var_names_metrics
    attr(df_metrics, "metric_names") <- metric_names_metrics

  } else {
    df_metrics <- data.frame()
  }

  if (return_metadata) {
    return(list(climate_data = df_vals, accuracy_metrics = df_metrics, metadata = mt))
  } else {
    return(list(climate_data = df_vals, accuracy_metrics = df_metrics))
  }
}

Importing and Interpreting with Large Language Models

Sustax climate data is delivered in structured CSV files containing metadata, accuracy metrics, and time-series data. Effectively interpreting this data with large language models (LLMs) relies on not only importing but also understanding the structure and content.

LLMs excel at automating the interpretation of diverse data sections by generating parsing and cleaning code and providing initial analyses or summaries. This greatly reduces manual data preparation and error risks, making climate risk assessment more efficient.

To facilitate meaningful interpretation:

  1. Provide the LLM with context by granting access to the Sustax Documentation Hub before processing data.
  2. There are two alternative to use so that the LLM can interpret the CSV file:
    • For textual context: Split CSV files into manageable parts: (1) metadata as plain text to preserve descriptive context (i.e., metadata.txt), and (2) accuracy metrics and (3) time-series sections as tabular CSVs (i.e., accuracy_metrics.csv, climate_timeseries_data.csv) to maintain structure.
    • For analytical context: Copy the official loader function from Sustax's Documentation Hub (right above). You can paste the copied function directly into the LLM prompt as text, the Generative AI system will interpret and gain immediate access to clean data structures for aggregation, plotting, or model evaluation. This way you will not need to split the original Sustax CSV.
  3. Test the understanding of the data by the LLM by asking questions like "Where is this data from?", "What period of data is included in the time series?"
  4. Use summarised or monthly data extracts over more extense daily data when possible, as LLMs process text via prompts or integrated tools rather than directly ingesting files.

To optimise the request of information, you better use monthly data instead of daily as LLM models don't directly "ingest" files—they process text via prompts, APIs, or integrated tools (e.g., Python code execution).

MxMT projection chart illustrating scenario-based temperature variations for sustainability analytics and climate data analysis.
Interpretation of a Sustax file by an LLM (GPT-5 Thinking)

Best practices

  • Prompt Engineering: Always include context, e.g., "Here is accuracy_metrics.csv content: [paste or upload]. Generate Python code to load and visualize it."
  • Tools Integration: Use LLMs with file-handling plugins (e.g., ChatGPT Advanced Data Analysis) to upload CSVs directly.
  • Testing: Validate splits ensure no data loss (e.g., row counts match). If using APIs, base64-encode files for transmission.
  • Security Note: Avoid sharing sensitive data; anonymize if needed.

(Coming soon) Sustax User's API

The Sustax User's API (SUA) is a programmatic interface designed for accessing and downloading climate data from the Sustax platform by command line. It enables users to retrieve any Sustax dataset using point-of-interest (POI) coordinates in the WGS84 latitude-longitude system. This API is particularly useful for developers, researchers, and organisations needing automated, bulk access to Sustax's models for applications like risk assessment, adaptation planning, or integration into custom tools. 

The API operates on a token-based authentication system and focuses on POI-based queries. Note that the API automates data requests without price validation, generating downloadable URLs for the requested datasets. 

Step 1.  Authentication: Obtain a User Token

This token authenticates subsequent requests and ensures secure access to your account's data entitlements.

  • Input: Provide your username and password in a POST request
  • Output: If successful, the API returns a response code of "200" along with a user_token (i.e., { "response": "200", "user_token": "your_token_here" }).
  • Endpoint: https://app.sustax.earth/api

Step 2. Submit Data Request

Before making a data request, users must obtain a dataset ID (e.g., for Shared Socioeconomic Pathways like SSP1-1.9 or SSP5-8.5). These Sustax codes are available right below (Sustax’s datasets tables).

No manual "pre-request" validation is needed; the system proceeds automatically to generate a request. Be careful with the pricing!

  • Input: Provide the following parameters
    • usr_tkn: Your authenticated user token from Step 1.
    • disclaimer_acceptance: Pass 1 to accept the disclaimer
    • lat: Latitude coordinate (WGS84 format, e.g., -19.828707 for a location in Mozambique).
    • lon: Longitude coordinate (WGS84 format, e.g., 34.841782).
    • first_year: Starting year for the data range (e.g., 2000).
    • last_year: Ending year for the data range (e.g., 2080).
    • Sustax code: The Sustax dataset ID(s) (e.g., for a specific SSP-RCP combination(s) and variable(s)).
  • Output: If successful, returns a response code of "200" with a request_url (a direct link to download the data, such as a CSV file containing climate variables, metrics, and projections) and a request_message (any status notes).
  • Endpoint: https://app.sustax.earth/api (GET request)

This generates a downloadable dataset tailored to the specified location and time period, including monthly or daily climate projections, accuracy metrics (e.g., Mean Bias Error, Pearson Correlation), and supporting details like units and timestamps.

The data is exported in Sustax CSV standard format (as seen in sample exports), containing headers for climate variables (e.g., Maximum Mean Temperature, Consecutive No-Rain Days), units, scenarios, and metrics. It includes timestamps, geographical details.