Skip to main content
FR-S3 · Bulk delivery · v1

Every filing,
in your data lake.

The full FinancialReports corpus, partitioned and delivered to S3. Query G20 markets and Europe with DuckDB, Athena, or Spark — without paying request-by-request.

Bucketfinancialreports-bucket
Regioneu-central-1
RefreshDaily snapshot · incremental manifest
AuthIAM cross-account · STS
FormatsParquet · JSONL · MD
AccessEnterprise · per agreement
01 · TL;DR

What S3 bulk delivery is, in three lines.

Same data as the API. Different access pattern. Built for teams running pipelines, training models, or backtesting against the full corpus.

What you get

A read-only IAM role into our bucket.

Parquet, JSONL, and Markdown across filings/, companies/, line_items/. Hive-style partitions on year, country, and filing type. Append-only — historical files never change.

Who it's for

Teams that scan more than they fetch.

Quants running cross-sectional analyses, ML teams training on filing text, data warehouses mirroring our corpus. If you're scanning at full-corpus scale, S3 is cheaper than the API.

How it's billed

Flat monthly + your egress.

Subscription includes the data and metadata. AWS bills your account for any cross-region or internet egress. Same-region reads via VPC Gateway Endpoint are zero cost.

02 · API vs S3

Pick by access pattern, not by team preference.

The API is the right answer for almost everything except bulk scans. Use this table to decide — both surfaces share the same identifiers, so you can mix them.

Use case
REST API
S3 bulk
Single filing, on demand
BestOne request. Low latency. No object listing.
Possible, but you need to know the key. Higher latency.
Powering a webhook-driven app
BestWebhook fires; fetch the one filing you care about.
Wrong tool. S3 isn't event-driven on the consumer side.
Cross-sectional KPI scan
Possible but pages add up. Request count grows quickly across the corpus.
BestOne DuckDB query against line_items/. Seconds.
LLM training on filing text
Hammers the rate limit. Not designed for it.
BestStream Markdown directly from processed/markdown/.
Backtest 10 years of disclosures
Slow and expensive at API rate limits.
BestPoint Spark or Athena at the partitions you need.
Watchlist for ~50 companies
BestSubscribe to filing.processed webhooks.
Overkill for this volume.
Mirror our data into your warehouse
Possible. You'd build the sync yourself.
BestSnapshot on a cron. Diff with the daily manifest.
i

Both surfaces share IDs. A filing's id in /filings/{id}/ is the same string used as the filing_id column in Parquet and the directory name under processed/markdown/. Mix freely.

03 · Quickstart

From zero to your first query.

Five steps. Assumes you have an AWS account and a recent DuckDB installed locally. For Athena and Spark, see §06 · Querying.

1

Send us your AWS account ID

Request access or email [email protected]. We'll attach a bucket policy that grants your IAM role read access to financialreports-bucket.

2

Create an IAM role on your side

A role with s3:GetObject and s3:ListBucket on the bucket. Trust your own account; we don't need to assume it.

3

List the bucket

Confirm access with aws s3 ls s3://financialreports-bucket/. You should see filings/, companies/, line_items/, processed/, manifests/.

4

Run your first query

Use DuckDB's httpfs extension to query Parquet directly. No download step. The example on the right shows latest annual reports from German companies.

5

Subscribe to the manifest

For incremental sync, watch manifests/incremental/YYYY-MM-DD.parquet. It lists every key written in the last 24 h, so you don't have to re-list the bucket.

terminal · DuckDB
-- Install once
INSTALL httpfs;
LOAD httpfs;

-- Your AWS credentials (or use SSO / role chain)
SET s3_region = 'eu-central-1';
SET s3_access_key_id = '…';
SET s3_secret_access_key = '…';

-- Latest German annual reports, last 12 months
SELECT
  company_id, company_name, published_at, source
FROM read_parquet(
  's3://financialreports-bucket/filings/year=2025/country=DE/type=annual_report/*.parquet'
)
WHERE published_at >= now() - INTERVAL 12 MONTH
ORDER BY published_at DESC
LIMIT 25;
04 · Bucket layout

Five top-level prefixes, Hive-style partitions.

Every prefix is partitioned by the columns you'd most plausibly filter on. Partition pruning is the difference between scanning 2 TB and scanning 8 GB.

bucket s3://financialreports-bucket/
filings/                        # filing metadata, one row per filing
├── year=2025/
│   ├── country=DE/
│   │   ├── type=annual_report/
│   │   │   └── part-00000.parquet
│   │   └── type=ad_hoc/
│   │       └── part-00000.parquet
│   └── country=FR/ …
├── year=2024/ …
└── year=…/                  # back to corpus start

companies/                      # dimension table — full snapshot daily
├── snapshot=2026-05-06/
│   └── companies.parquet
└── snapshot=2026-05-05/line_items/                     # L3 — extracted KPIs (long format)
├── year=2025/
│   └── country=DE/
│       └── part-00000.parquet
└── …

processed/
├── markdown/                   # AI-ready full text
│   └── year=2025/
│       └── filing_id=flg_01HGZ…/
│           ├── document.md
│           └── sections.jsonl
└── json/                       # clean JSON per filing
    └── year=2025/manifests/                      # incremental sync helpers
├── filings/                    # full-history filing manifest
└── incremental/
    └── YYYY-MM-DD.parquet
year=YYYY
Reporting year, not publish year. An annual report for fiscal year FYxxxx (whenever published) lives under year=xxxx/. Match this if you're slicing by fiscal period.
country=XX
ISO 3166-1 alpha-2. Country of primary listing. A dual-listed issuer appears under one country only — use companies/ to resolve secondary listings.
type=…
FRCF filing type. One of annual_report, half_year_report, quarterly_update, ad_hoc, insider_dealing, shareholder_notification, etc. Same enum as the API.
snapshot=…
For dimension tables. companies/ is rewritten in full each day — pick the latest snapshot=… partition for current state, or an earlier one for point-in-time joins.
filing_id=…
Per-filing folders for processed content. Markdown and JSONL live under processed/markdown/year=YYYY/filing_id=flg_…/. The filing ID is identical to the API's id field.
05 · File formats

Three formats, by access pattern.

We chose Parquet for analytics, JSONL for streaming, Markdown for LLMs. No proprietary formats; everything reads with off-the-shelf tooling.

.parquet

Parquet

Columnar, compressed, much smaller than CSV. Used for everything tabular: filings metadata, line items, company snapshots.

compression
ZSTD
row group
~128 MB
schema
append-only
part size
~256 MB

Use it for: aggregations, joins, time-series scans, anything that filters on a few columns.

.jsonl

JSONL

One JSON object per line. Used for filing sections — preserves nested structure that's awkward in Parquet.

compression
gzip
encoding
UTF-8
schema
versioned
line len
~64 KB

Use it for: streaming into Kafka, line-by-line ML preprocessing, anything that reads top-to-bottom.

.md

Markdown

Filings rendered to clean Markdown — headings, tables, lists, no boilerplate. The same output the API returns from /filings/{id}/markdown/.

flavor
GFM
encoding
UTF-8
tables
GFM tables
frontmatter
YAML

Use it for: RAG indexing, LLM fine-tuning, full-text search, anything that wants human-readable text.

06 · Querying

Three engines, same partitions.

Same query expressed in DuckDB, Athena, and Spark. Each is the right tool in different settings — local exploration, serverless SQL, and managed clusters.

DuckDB · top-15 movers, all DACH ad-hoc disclosures · YTD
SELECT
  c.company_name,
  c.country,
  count(*) AS n_disclosures,
  max(f.published_at) AS most_recent
FROM read_parquet(
  's3://financialreports-bucket/filings/year=2025/country=DE/type=ad_hoc/*.parquet'
) f
LEFT JOIN read_parquet(
  's3://financialreports-bucket/companies/snapshot=2026-05-06/companies.parquet'
) c USING (company_id)
WHERE c.country IN ('DE', 'AT', 'CH')
GROUP BY 1, 2
ORDER BY n_disclosures DESC
LIMIT 15;
Athena · same query, same partitions
-- One-time table creation (run by your platform team)
CREATE EXTERNAL TABLE fr_filings (
  filing_id STRING,
  company_id STRING,
  published_at TIMESTAMP,
  source STRING
)
PARTITIONED BY (year INT, country STRING, "type" STRING)
STORED AS PARQUET
LOCATION 's3://financialreports-bucket/filings/';

MSCK REPAIR TABLE fr_filings;

-- Then, the actual query
SELECT company_id, count(*) AS n
FROM fr_filings
WHERE year = 2025
  AND country IN ('DE', 'AT', 'CH')
  AND "type" = 'ad_hoc'
GROUP BY company_id
ORDER BY n DESC
LIMIT 15;
PySpark · for cluster-scale workloads
# Reads partition columns automatically from the path
filings = (spark.read
  .parquet("s3a://financialreports-bucket/filings/")
  .filter("year = 2025 AND country IN ('DE','AT','CH') AND type = 'ad_hoc'"))

companies = (spark.read
  .parquet("s3a://financialreports-bucket/companies/snapshot=2026-05-06/companies.parquet"))

(filings.join(companies, "company_id", "left")
        .groupBy("company_id", "company_name")
        .count()
        .orderBy("count", ascending=False)
        .limit(15)
        .show(truncate=False))
Python · pandas + pyarrow · for notebook work
import pandas as pd

# pyarrow handles partition pruning when you filter on partition keys
filings = pd.read_parquet(
    "s3://financialreports-bucket/filings/",
    filters=[
        ("year", "=", 2025),
        ("country", "in", ["DE", "AT", "CH"]),
        ("type", "=", "ad_hoc"),
    ],
    columns=["filing_id", "company_id", "published_at"],
)

(filings.groupby("company_id")
        .size()
        .nlargest(15))
!

Always filter on partition keys first. A query without year= or country= in the WHERE clause scans the full corpus. With both, typical scans drop by orders of magnitude.

07 · Delivery model

Append-only, with manifests for incremental sync.

A typical filing's path through the bucket. Files are written under their final partition once and never moved — restatements get a new version column, never an in-place edit.

T+0 s

Filing published at source

Issuer publishes to a supported source. Our ingestion picks it up shortly after.

T+ seconds

Metadata row written

A row appears in filings/year=YYYY/country=XX/type=…/. Status starts as received; the filing.received webhook fires.

T+ minutes

Markdown & JSON written

Files appear under processed/markdown/year=YYYY/filing_id=…/ and processed/json/…. The metadata row's status updates to processed.

T+ minutes

Line items extracted (L3)

For annual and half-year reports, KPIs land in line_items/year=YYYY/country=XX/. Quarterly updates and ad-hoc disclosures are skipped — they don't have standardized financials.

T+ hours

Daily manifest written

manifests/incremental/YYYY-MM-DD.parquet lists every key written in the prior 24 h. Tail this for incremental sync.

T+ days

Restatements (rare)

If an issuer corrects a filing, a new row is appended with the same filing_id but version=2. The original row is preserved for point-in-time analysis. Never edited in place.

08 · Security & compliance

IAM-only, region-pinned, audit-logged.

No public access. No long-lived keys outside our infrastructure. Every read is logged and available to you on request.

Access

Cross-account IAM, no static keys.

Your AWS account ID is added to a bucket policy. Inside your account, you create the role and assume it however you like — STS, SSO, instance profile, OIDC. We never see your credentials.

Encryption

SSE-S3 at rest, TLS 1.3 in transit.

All objects are encrypted with AWS-managed keys. Enterprise tier supports SSE-KMS with a CMK in your account if your security policy requires it.

Audit

CloudTrail data events, retained.

Every GetObject and ListBucket is logged. Enterprise tier streams the log directly to your SIEM.

i

Region pinning. The shared bucket is in eu-central-1 (Frankfurt). Same-region reads via a VPC Gateway Endpoint for S3 are zero-cost on AWS's side. Cross-region reads incur AWS egress charges to your account.

09 · FAQ

Questions every prospect asks.

If yours isn't here, email [email protected]. We'll add it.

Can I get the data in my own bucket instead?

Yes — that's the Enterprise tier. We replicate to a bucket you own using AWS S3 CRR / SRR. Same partitions, same files, your IAM. You control retention and lifecycle.

Is GCS or Azure Blob supported?

Not on the standard plans. On Enterprise we run a sync job into a GCS or ABFS bucket you own. Same source files, ~15 min added latency.

Do you offer a historical backfill download?

The standard bucket already contains the full historical corpus — no separate backfill. New customers can scan everything immediately.

How does this differ from the REST API?

Same data, different access pattern. The API is best for low-latency reads and event-driven apps. S3 is best for full-corpus scans, ML training, and warehouse mirrors. See §02 · API vs S3.

Are restatements visible in S3?

Yes. Every row carries a version column. SELECT … WHERE version = (SELECT max(version) ...) for current state, or filter on received_at for point-in-time joins. We never edit a file in place.

What if I'm in a different AWS region than the bucket?

DuckDB and Spark still read fine — AWS bills your account for cross-region egress at their published rate. For sustained workloads, ask us about replication into your region.

Can I get just one country?

The shared bucket includes everything; partition pruning means you only pay AWS for what you scan. If you specifically need a smaller bucket (e.g. for compliance reasons), Enterprise can carve out a country-scoped delivery.

Do you charge per request like the API?

No. Subscription is flat. AWS bills your account for storage class transitions, Athena scans, or egress out of the bucket region — those are between you and AWS.

Ready to point your warehouse at us?

Send your AWS account ID and the country / filing-type partitions you care most about. We'll provision read access in under a business day.