Parse JSON

Paid Action: This action will consume credits based on the amount of content being parsed, see more below.

Beta Feature: This feature is currently in beta and restricted to approved users. If you're are interested in trying it, please contact support and we can enable this feature for your account.

Type: parse_json

The parse_json action extracts data from web pages and online PDFs. It uses AI to parse web content from text into a pre-defined data schema and return it as a JSON object.

The action allows you to convert unstructured content such as academic papers, forms, and webpages into JSON objects, which you can use in automations, analysis, or further processing.

This feature currently works for online PDFs and web page text.

Parameters

Name

Type

Required

Description

data_schema_id

string

The id of the data schema you have defined that you want to transform the content into. You must provide a data_schema or data_schema_id with your request.

data_schema

json

A JSON object describing the data_schema you want to transform the content into.

You must provide a data_schema or data_schema_id with your request.

instruction

string

A custom instruction, in addition to any detail you have added to the data schema, that you want to include with this particular parse.

model

string`

The AI model you wish to use to parse the content into JSON. Default: gpt-4o-mini Accepted: ["gpt-4o-mini"]

input_token_cap

int

The max number of source input tokens that will be passed to the AI model to parse. This can be used to prevent unnecessary credit usage. If your source input is longer than the token cap, it will be abbreviated. Default: 1,000,000

selector

string

The selector that defines an element you want to parse the content of - this is useful if you are only interested in the contents of a certain element.

output_type

string

Should the action output be saved to a file where a URL will be returned or should the parsed JSON object be included directly in the request. Default: file Accepted: ["file", "inline"]

max_pages

int

If you are parsing a PDF you can specify this parameter to limit the number of pages that are passed to the LLM. Default: no limit

See universal parameters.

Defining Data Schemas

A data schema tells the model exactly what JSON structure to produce.

You can define schemas in two ways:

Inline schemas (defined directly inside the action)
Reusable schemas (created via the Schema API and referenced by ID in your requests)

Schema Structure

A schema has:

Property

Type

Description

description

string

Explains what data the schema extracts and provides context to help the AI model understand the extraction goal. Example: "Extract product details from this e-commerce product page"

fields

array

Each field defines a piece of data to extract from the content. See field properties below.

name

string

This identifies the schema and should clearly indicate what data it extracts. Example: "ProductInfo", "ArticleMetadata", "ContactForm"

Each field in the fields array has:

descripton

string

Include details about format, handling of missing values, or special cases.

Example: "Maximum salary in GBP. If only one value is provided, use the same value for both min and max. Return null if not provided."

fields

array

Required only for object and array types.

name

string

Use clear, descriptive names that follow your preferred naming convention (e.g., snake_case or camelCase). Example: "product_name", "published_date", "author_email"

type

string

Determines how the AI interprets and structures the extracted data. Must be one of the supported types below.

Supported Field Types

Type

Description

array

List of items

boolean

True/False

datetime

timestamp

decimal

Precise decimal

double

Floating-point number

integer

Whole number

object

Nested structured object

string

Text value

Inline Schema Example

{
  "type": "parse_json",
  "data_schema": {
    "name": "ArticleMetadata",
    "description": "Extract metadata from an article",
    "fields": [
      {
        "type": "string",
        "name": "title",
        "description": "Article title"
      },
      {
        "type": "string",
        "name": "author",
        "description": "Author name"
      },
      {
        "type": "datetime",
        "name": "published",
        "description": "Publication date"
      }
    ]
  },
  "model": "gpt-4o-mini",
  "output_type": "inline"
}

This example shows:

Simple fields (string, datetime) for basic data
Object fields for grouped related data with nested fields
Array fields for lists of items with nested fields defining each item's structure

Schema Operations

Instead of defining schemas inline every time, they can be saved to your Gaffa account and be reused across multiple requests. This makes your actions more readable, easier to maintain, and ensures consistency when parsing similar content.

Creating a Saved Schema

Use the POST /v1/schemas endpoint to create a reusable schema:

curl -L \
  --request POST \
  --url 'https://api.gaffa.dev/v1/schemas' \
  --header 'X-API-Key: YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "name": "ProductInfo",
    "description": "Extract product details from e-commerce pages",
    "fields": [
      {
        "type": "string",
        "name": "product_name",
        "description": "The product title"
      },
      {
        "type": "decimal",
        "name": "price",
        "description": "Current price"
      },
      {
        "type": "boolean",
        "name": "in_stock",
        "description": "Product availability"
      },
      {
        "type": "object",
        "name": "ratings",
        "description": "Product rating information",
        "fields": [
          {
            "type": "double",
            "name": "average",
            "description": "Average rating score"
          },
          {
            "type": "integer",
            "name": "total_reviews",
            "description": "Number of reviews"
          }
        ]
      },
      {
        "type": "array",
        "name": "tags",
        "description": "Product tags",
        "fields": [
          {
            "type": "string",
            "name": "tag",
            "description": "Individual tag name"
          }
        ]
      }
    ]
  }'

Response:

{
  "id": "schema_abc123xyz",
  "name": "ProductInfo",
  "description": "Extract product details from e-commerce pages",
  "fields": [...]
}

Save the id returned in the response, you'll use this to reference the schema in your requests

Managing Schemas

List all schemas:

Allows you to view all schemas saved to your account:

Endpoint: GET /v1/schemas

curl -L \
  --url 'https://api.gaffa.dev/v1/schemas' \
  --header 'X-API-Key: YOUR_API_KEY' \
  --header 'Accept: */*'

Update a schema:

Allows you to modify an existing schema by its ID:

Endpoint: PUT /v1/schemas

curl -L \
  --request PUT \
  --url 'https://api.gaffa.dev/v1/schemas/{id}' \
  --header 'X-API-Key: YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "id": "schema_abc123xyz",
    "name": "ProductInfo",
    "description": "Extract detailed product information from e-commerce pages",
    "fields": [
      {
        "type": "string",
        "name": "product_name",
        "description": "The product title"
      },
      {
        "type": "decimal",
        "name": "price",
        "description": "Current price"
      },
      {
        "type": "string",
        "name": "brand",
        "description": "Product brand name"
      }
    ]
  }'

Delete a schema:

Removes a schema from your account:

Endpoint: DELETE /v1/schemas/:id

curl -L \
  --request DELETE \
  --url 'https://api.gaffa.dev/v1/schemas/{id}' \
  --header 'X-API-Key: YOUR_API_KEY' \
  --header 'Accept: */*'

Common Schema Patterns

Simple List Extraction

{
  "name": "TagList",
  "description": "Extract article tags",
  "fields": [
    {
      "type": "array",
      "name": "tags",
      "description": "List of article tags",
      "fields": [
        {
          "type": "string",
          "name": "tag",
          "description": "Individual tag name"
        }
      ]
    }
  ]
}

Nested Objects

{
  "name": "ProductWithReviews",
  "description": "Product details with nested review data",
  "fields": [
    {
      "type": "string",
      "name": "product_name",
      "description": "Product name"
    },
    {
      "type": "object",
      "name": "pricing",
      "description": "Pricing information",
      "fields": [
        {
          "type": "decimal",
          "name": "current_price",
          "description": "Current price"
        },
        {
          "type": "decimal",
          "name": "original_price",
          "description": "Original price before discount"
        },
        {
          "type": "integer",
          "name": "discount_percentage",
          "description": "Discount percentage"
        }
      ]
    }
  ]
}

Pricing

The credits this action uses depends on the model used. Here are the current supported models and their pricing:

Model

Input Token Cost

Output Token Cost

gpt-4o-mini

1 credit per 10,000 input tokens

4 credits per 10,000 output tokens

PreviousPrint NextParse Table

Last updated 1 month ago