Parse JSON
Beta Feature: This feature is currently in beta and restricted to approved users. If you're are interested in trying it, please contact support and we can enable this feature for your account.
Type: parse_json
The parse_json action extracts data from web pages and online PDFs. It uses AI to parse web content from text into a pre-defined data schema and return it as a JSON object.
The action allows you to convert unstructured content such as academic papers, forms, and webpages into JSON objects, which you can use in automations, analysis, or further processing.
This feature currently works for online PDFs and web page text.
Parameters
data_schema_id
string
The id of the data schema you have defined that you want to transform the content into.
You must provide a data_schema or data_schema_id with your request.
data_schema
json
A JSON object describing the data_schema you want to transform the content into.
You must provide a data_schema or data_schema_id with your request.
instruction
string
A custom instruction, in addition to any detail you have added to the data schema, that you want to include with this particular parse.
model
string`
The AI model you wish to use to parse the content into JSON.
Default: gpt-4o-mini
Accepted: ["gpt-4o-mini"]
input_token_cap
int
The max number of source input tokens that will be passed to the AI model to parse. This can be used to prevent unnecessary credit usage. If your source input is longer than the token cap, it will be abbreviated. Default: 1,000,000
selector
string
The selector that defines an element you want to parse the content of - this is useful if you are only interested in the contents of a certain element.
output_type
string
Should the action output be saved to a file where a URL will be returned or should the parsed JSON object be included directly in the request.
Default: file
Accepted: ["file", "inline"]
max_pages
int
If you are parsing a PDF you can specify this parameter to limit the number of pages that are passed to the LLM. Default: no limit
See universal parameters.
Defining Data Schemas
A data schema tells the model exactly what JSON structure to produce.
You can define schemas in two ways:
Inline schemas (defined directly inside the action)
Reusable schemas (created via the Schema API and referenced by ID in your requests)
Schema Structure
A schema has:
description
string
Explains what data the schema extracts and provides context to help the AI model understand the extraction goal.
Example: "Extract product details from this e-commerce product page"
fields
array
Each field defines a piece of data to extract from the content. See field properties below.
name
string
This identifies the schema and should clearly indicate what data it extracts.
Example: "ProductInfo", "ArticleMetadata", "ContactForm"
Each field in the fields array has:
descripton
string
Include details about format, handling of missing values, or special cases.
Example: "Maximum salary in GBP. If only one value is provided, use the same value for both min and max. Return null if not provided."
fields
array
Required only for object and array types.
name
string
Use clear, descriptive names that follow your preferred naming convention (e.g., snake_case or camelCase). Example: "product_name", "published_date", "author_email"
type
string
Determines how the AI interprets and structures the extracted data. Must be one of the supported types below.
Supported Field Types
array
List of items
boolean
True/False
datetime
timestamp
decimal
Precise decimal
double
Floating-point number
integer
Whole number
object
Nested structured object
string
Text value
Inline Schema Example
This example shows:
Simple fields (
string,datetime) for basic dataObject fields for grouped related data with nested
fieldsArray fields for lists of items with nested
fieldsdefining each item's structure
Schema Operations
Instead of defining schemas inline every time, they can be saved to your Gaffa account and be reused across multiple requests. This makes your actions more readable, easier to maintain, and ensures consistency when parsing similar content.
Creating a Saved Schema
Use the POST /v1/schemas endpoint to create a reusable schema:
Response:
Save the id returned in the response, you'll use this to reference the schema in your requests
Managing Schemas
List all schemas:
Allows you to view all schemas saved to your account:
Endpoint: GET /v1/schemas
Update a schema:
Allows you to modify an existing schema by its ID:
Endpoint: PUT /v1/schemas
Delete a schema:
Removes a schema from your account:
Endpoint: DELETE /v1/schemas/:id
Common Schema Patterns
Simple List Extraction
Nested Objects
Pricing
The credits this action uses depends on the model used. Here are the current supported models and their pricing:
gpt-4o-mini
1 credit per 10,000 input tokens
4 credits per 10,000 output tokens
Last updated