Building a Document Analysis Pipeline for Regulatory Filings

Why Regulatory Documents Are Hard

Regulatory submissions — FDA New Drug Applications, EMA Marketing Authorisation Applications, SEC filings — share several characteristics that make manual analysis slow:

Volume: a full NDA can run to 100,000+ pages
Structure: content is distributed across modules with complex cross-references
Implicit claims: risk assessments embed assumptions never stated explicitly
Temporal layering: amendments, supplements, and correspondence accumulate over years

Configuring the Pipeline

For regulatory documents, we recommend the following configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "document_type": "regulatory_submission",
  "framework": "ectd",
  "extraction_targets": [
    "safety_claims",
    "efficacy_claims",
    "risk_benefit_statements",
    "open_issues",
    "commitments"
  ]
}

Key Outputs

Safety Signal Map

All adverse events reported across clinical modules are aggregated by system organ class, with frequency and severity data extracted and tabulated.

Commitment Register

Post-approval commitments are extracted as structured items, each linked to the section of the submission where they appear.

Cross-Module Consistency Check

assay.it flags claims that appear in one module but are contradicted or unsubstantiated in another — a common source of regulatory queries.

Integration

Output is delivered in structured JSON-LD and can be imported directly into regulatory information management systems or linked to dossier management platforms via our API.