githubEdit

Parse PDF to Structured JSON

An example request that uses Gaffa to extract structured data from an online PDF.

The following example is a request we've pre-built to show you Gaffa's capabilities against our demo sitearrow-up-right. You can run this request right here in the Gaffa API Playgroundarrow-up-right.

This example demonstrates how to extract data from PDF documents. Gaffa downloads the PDF and uses AI to intelligently parse the content according to your schema, making it perfect for building research databases, citation managers, or literature review tools.

This feature currently works for online PDFs.

API Request

The request below uses the POST endpointarrow-up-right to download a demo research paper from the hosted PDFs, wait for it to load, and then parse the first page to extract author information and paper metadata.

{
  "url": "https://demo.gaffa.dev/simulate/pdf/ReasoningAboutActionAndChange.pdf",
  "proxy_location": null,
  "async": false,
  "max_cache_age": 0,
  "settings": {
    "record_request": false,
    "actions": [
      {
        "type": "download_file"
      },
      {
        "type": "parse_json",
        "data_schema": {
          "name": "AcademicPaper",
          "description": "Schema for parsing academic paper summary and author information",
          "fields": [
            {
              "type": "string",
              "name": "title",
              "description": "The full title of the academic paper"
            },
            {
              "type": "string",
              "name": "abstract",
              "description": "The paper's abstract or summary"
            },
            {
              "type": "array",
              "name": "authors",
              "description": "List of authors who contributed to the paper",
              "fields": [
                {
                  "type": "string",
                  "name": "name",
                  "description": "Author's full name as it appears in the paper"
                },
                {
                  "type": "array",
                  "name": "affiliations",
                  "description": "Institutional affiliations for this author",
                  "fields": [
                    {
                      "type": "string",
                      "name": "institution",
                      "description": "Name of the university or research institution"
                    },
                    {
                      "type": "string",
                      "name": "department",
                      "description": "Department or division name"
                    },
                    {
                      "type": "string",
                      "name": "city",
                      "description": "City where the institution is located"
                    },
                    {
                      "type": "string",
                      "name": "country",
                      "description": "Country of the institution"
                    }
                  ]
                },
                {
                  "type": "string",
                  "name": "email",
                  "description": "Author's contact email address if provided"
                }
              ]
            },
            {
              "type": "array",
              "name": "keywords",
              "description": "Key terms and topics covered in the paper",
              "fields": [
                {
                  "type": "string",
                  "name": "keyword",
                  "description": "Individual keyword or phrase"
                }
              ]
            }
          ]
        },
        "instruction": "Parse this academic paper focusing on the title, abstract, author information, and keywords typically found on the first page. Extract all author names, their institutional affiliations with department and location details, and their contact information.",
        "model": "gpt-4o-mini",
        "output_type": "inline",
        "max_pages": 1
      }
    ]
  }
}

Actions

Response

The parsed data is returned as a structured JSON object matching your schema:

Last updated