CSV Format Optimization¶

Redshift Spectra uses the get_statement_result_v2 API with CSV format for significantly faster result parsing.

The Problem¶

The default get_statement_result API returns typed values in a nested structure:

{
  "Records": [
    [
      {"stringValue": "Product A"},
      {"longValue": 100},
      {"doubleValue": 29.99}
    ],
    [
      {"stringValue": "Product B"},
      {"longValue": 200},
      {"doubleValue": 49.99}
    ]
  ],
  "ColumnMetadata": [
    {"name": "product", "typeName": "varchar"},
    {"name": "quantity", "typeName": "int4"},
    {"name": "price", "typeName": "float8"}
  ]
}

Problems:

Complex nested structure
Type detection overhead
Memory-intensive parsing
CPU-bound for large results

The Solution¶

CSV format returns a flat string that's much faster to parse:

{
  "FormattedRecords": "Product A,100,29.99\nProduct B,200,49.99\n",
  "ColumnMetadata": [
    {"name": "product", "typeName": "varchar"},
    {"name": "quantity", "typeName": "int4"},
    {"name": "price", "typeName": "float8"}
  ]
}

How It Works¶

flowchart LR
    subgraph TypedFormat["Typed Format (Legacy)"]
        T1[Nested Objects] --> T2[Type Detection]
        T2 --> T3[Complex Parsing]
        T3 --> T4[Result Dict]
    end

    subgraph CSVFormat["CSV Format (Optimized)"]
        C1[Flat String] --> C2[CSV Parser]
        C2 --> C3[Result Dict]
    end

Implementation¶

Fetching Results with CSV¶

# services/redshift.py

def get_statement_result(
    self,
    statement_id: str,
    next_token: str | None = None,
    use_csv_format: bool = True,
) -> dict[str, Any]:
    """Get results with CSV format optimization."""

    params = {"Id": statement_id}
    if next_token:
        params["NextToken"] = next_token

    if use_csv_format:
        # Use v2 API with CSV format
        response = self.client.get_statement_result_v2(
            **params,
            Format="CSV",
        )
        return self._parse_csv_result(response)
    else:
        # Fallback to typed format
        response = self.client.get_statement_result(**params)
        return self._parse_typed_result(response)

CSV Parsing¶

def _parse_csv_result(self, response: dict) -> dict:
    """Parse CSV formatted response."""
    import csv
    import io

    # Extract column names
    columns = [col["name"] for col in response.get("ColumnMetadata", [])]

    # Parse CSV data
    records = []
    csv_data = response.get("FormattedRecords", "")

    if csv_data:
        reader = csv.reader(io.StringIO(csv_data))
        for row in reader:
            if len(row) == len(columns):
                records.append(dict(zip(columns, row)))

    return {
        "columns": columns,
        "records": records,
        "total_rows": response.get("TotalNumRows", len(records)),
        "next_token": response.get("NextToken"),
        "format": "CSV",
    }

Automatic Fallback¶

If CSV format is not supported, Spectra falls back automatically:

try:
    response = self.client.get_statement_result_v2(
        Id=statement_id,
        Format="CSV",
    )
    return self._parse_csv_result(response)

except ClientError as e:
    if "ValidationException" in str(e) and use_csv_format:
        # CSV not supported, fallback to typed
        logger.info("CSV format not supported, using typed format")
        return self.get_statement_result(
            statement_id=statement_id,
            next_token=next_token,
            use_csv_format=False,
        )
    raise

Performance Comparison¶

Parsing Time by Result Size¶

Result Size	Typed Format	CSV Format	Improvement
1,000 rows	15ms	5ms	66%
10,000 rows	120ms	35ms	71%
100,000 rows	1,200ms	300ms	75%
1,000,000 rows	12,000ms	2,800ms	77%

Memory Usage¶

Result Size	Typed Format	CSV Format	Reduction
10,000 rows	45MB	12MB	73%
100,000 rows	450MB	120MB	73%

Type Handling¶

CSV format returns all values as strings. Type conversion is handled at the application layer:

def _convert_types(self, records: list[dict], metadata: list[dict]) -> list[dict]:
    """Convert string values to appropriate types."""

    type_map = {col["name"]: col["typeName"] for col in metadata}

    converted = []
    for record in records:
        row = {}
        for key, value in record.items():
            if value == "" or value is None:
                row[key] = None
            elif type_map.get(key) in ("int2", "int4", "int8"):
                row[key] = int(value)
            elif type_map.get(key) in ("float4", "float8", "numeric"):
                row[key] = float(value)
            elif type_map.get(key) == "bool":
                row[key] = value.lower() in ("t", "true", "1")
            else:
                row[key] = value
        converted.append(row)

    return converted

Limitations¶

String Representation

All values are returned as strings. Complex types (arrays, JSON) may need additional parsing.

NULL Handling

NULL values appear as empty strings in CSV format. Ensure your application handles this correctly.

API Availability

get_statement_result_v2 requires recent boto3 version. Spectra includes automatic fallback for compatibility.

Configuration¶

CSV format is enabled by default. To disable:

# Force typed format for specific queries
result = redshift_service.get_statement_result(
    statement_id=stmt_id,
    use_csv_format=False  # Use typed format
)

Best Practices¶

Use CSV for Large Results

CSV format provides the most benefit for results with 1,000+ rows.

Consider Type Requirements

If you need precise numeric types (e.g., Decimal for financial data), consider using typed format or implementing custom type conversion.

Monitor Parse Times

Add metrics to track parsing duration and identify optimization opportunities.