Every filing,
in your data lake.
The full FinancialReports corpus, partitioned and delivered to S3. Query G20 markets and Europe with DuckDB, Athena, or Spark — without paying request-by-request.
FinancialReports docs
What S3 bulk delivery is, in three lines.
Same data as the API. Different access pattern. Built for teams running pipelines, training models, or backtesting against the full corpus.
A read-only IAM role into our bucket.
Parquet, JSONL, and Markdown across filings/, companies/, line_items/. Hive-style partitions on year, country, and filing type. Append-only — historical files never change.
Teams that scan more than they fetch.
Quants running cross-sectional analyses, ML teams training on filing text, data warehouses mirroring our corpus. If you're scanning at full-corpus scale, S3 is cheaper than the API.
Flat monthly + your egress.
Subscription includes the data and metadata. AWS bills your account for any cross-region or internet egress. Same-region reads via VPC Gateway Endpoint are zero cost.
Pick by access pattern, not by team preference.
The API is the right answer for almost everything except bulk scans. Use this table to decide — both surfaces share the same identifiers, so you can mix them.
line_items/. Seconds.processed/markdown/.filing.processed webhooks.Both surfaces share IDs. A filing's id in /filings/{id}/ is the same string used as the filing_id column in Parquet and the directory name under processed/markdown/. Mix freely.
From zero to your first query.
Five steps. Assumes you have an AWS account and a recent DuckDB installed locally. For Athena and Spark, see §06 · Querying.
Send us your AWS account ID
Request access or email [email protected]. We'll attach a bucket policy that grants your IAM role read access to financialreports-bucket.
Create an IAM role on your side
A role with s3:GetObject and s3:ListBucket on the bucket. Trust your own account; we don't need to assume it.
List the bucket
Confirm access with aws s3 ls s3://financialreports-bucket/. You should see filings/, companies/, line_items/, processed/, manifests/.
Run your first query
Use DuckDB's httpfs extension to query Parquet directly. No download step. The example on the right shows latest annual reports from German companies.
Subscribe to the manifest
For incremental sync, watch manifests/incremental/YYYY-MM-DD.parquet. It lists every key written in the last 24 h, so you don't have to re-list the bucket.
-- Install once INSTALL httpfs; LOAD httpfs; -- Your AWS credentials (or use SSO / role chain) SET s3_region = 'eu-central-1'; SET s3_access_key_id = '…'; SET s3_secret_access_key = '…'; -- Latest German annual reports, last 12 months SELECT company_id, company_name, published_at, source FROM read_parquet( 's3://financialreports-bucket/filings/year=2025/country=DE/type=annual_report/*.parquet' ) WHERE published_at >= now() - INTERVAL 12 MONTH ORDER BY published_at DESC LIMIT 25;
Five top-level prefixes, Hive-style partitions.
Every prefix is partitioned by the columns you'd most plausibly filter on. Partition pruning is the difference between scanning 2 TB and scanning 8 GB.
filings/ # filing metadata, one row per filing ├── year=2025/ │ ├── country=DE/ │ │ ├── type=annual_report/ │ │ │ └── part-00000.parquet │ │ └── type=ad_hoc/ │ │ └── part-00000.parquet │ └── country=FR/ … ├── year=2024/ … └── year=…/ # back to corpus start companies/ # dimension table — full snapshot daily ├── snapshot=2026-05-06/ │ └── companies.parquet └── snapshot=2026-05-05/ … line_items/ # L3 — extracted KPIs (long format) ├── year=2025/ │ └── country=DE/ │ └── part-00000.parquet └── … processed/ ├── markdown/ # AI-ready full text │ └── year=2025/ │ └── filing_id=flg_01HGZ…/ │ ├── document.md │ └── sections.jsonl └── json/ # clean JSON per filing └── year=2025/ … manifests/ # incremental sync helpers ├── filings/ # full-history filing manifest └── incremental/ └── YYYY-MM-DD.parquet
year=xxxx/. Match this if you're slicing by fiscal period.
companies/ to resolve secondary listings.
annual_report, half_year_report, quarterly_update, ad_hoc, insider_dealing, shareholder_notification, etc. Same enum as the API.
companies/ is rewritten in full each day — pick the latest snapshot=… partition for current state, or an earlier one for point-in-time joins.
processed/markdown/year=YYYY/filing_id=flg_…/. The filing ID is identical to the API's id field.
Three formats, by access pattern.
We chose Parquet for analytics, JSONL for streaming, Markdown for LLMs. No proprietary formats; everything reads with off-the-shelf tooling.
Parquet
Columnar, compressed, much smaller than CSV. Used for everything tabular: filings metadata, line items, company snapshots.
Use it for: aggregations, joins, time-series scans, anything that filters on a few columns.
JSONL
One JSON object per line. Used for filing sections — preserves nested structure that's awkward in Parquet.
Use it for: streaming into Kafka, line-by-line ML preprocessing, anything that reads top-to-bottom.
Markdown
Filings rendered to clean Markdown — headings, tables, lists, no boilerplate. The same output the API returns from /filings/{id}/markdown/.
Use it for: RAG indexing, LLM fine-tuning, full-text search, anything that wants human-readable text.
Three engines, same partitions.
Same query expressed in DuckDB, Athena, and Spark. Each is the right tool in different settings — local exploration, serverless SQL, and managed clusters.
SELECT c.company_name, c.country, count(*) AS n_disclosures, max(f.published_at) AS most_recent FROM read_parquet( 's3://financialreports-bucket/filings/year=2025/country=DE/type=ad_hoc/*.parquet' ) f LEFT JOIN read_parquet( 's3://financialreports-bucket/companies/snapshot=2026-05-06/companies.parquet' ) c USING (company_id) WHERE c.country IN ('DE', 'AT', 'CH') GROUP BY 1, 2 ORDER BY n_disclosures DESC LIMIT 15;
-- One-time table creation (run by your platform team) CREATE EXTERNAL TABLE fr_filings ( filing_id STRING, company_id STRING, published_at TIMESTAMP, source STRING ) PARTITIONED BY (year INT, country STRING, "type" STRING) STORED AS PARQUET LOCATION 's3://financialreports-bucket/filings/'; MSCK REPAIR TABLE fr_filings; -- Then, the actual query SELECT company_id, count(*) AS n FROM fr_filings WHERE year = 2025 AND country IN ('DE', 'AT', 'CH') AND "type" = 'ad_hoc' GROUP BY company_id ORDER BY n DESC LIMIT 15;
# Reads partition columns automatically from the path filings = (spark.read .parquet("s3a://financialreports-bucket/filings/") .filter("year = 2025 AND country IN ('DE','AT','CH') AND type = 'ad_hoc'")) companies = (spark.read .parquet("s3a://financialreports-bucket/companies/snapshot=2026-05-06/companies.parquet")) (filings.join(companies, "company_id", "left") .groupBy("company_id", "company_name") .count() .orderBy("count", ascending=False) .limit(15) .show(truncate=False))
import pandas as pd # pyarrow handles partition pruning when you filter on partition keys filings = pd.read_parquet( "s3://financialreports-bucket/filings/", filters=[ ("year", "=", 2025), ("country", "in", ["DE", "AT", "CH"]), ("type", "=", "ad_hoc"), ], columns=["filing_id", "company_id", "published_at"], ) (filings.groupby("company_id") .size() .nlargest(15))
Always filter on partition keys first. A query without year= or country= in the WHERE clause scans the full corpus. With both, typical scans drop by orders of magnitude.
Append-only, with manifests for incremental sync.
A typical filing's path through the bucket. Files are written under their final partition once and never moved — restatements get a new version column, never an in-place edit.
Filing published at source
Issuer publishes to a supported source. Our ingestion picks it up shortly after.
Metadata row written
A row appears in filings/year=YYYY/country=XX/type=…/. Status starts as received; the filing.received webhook fires.
Markdown & JSON written
Files appear under processed/markdown/year=YYYY/filing_id=…/ and processed/json/…. The metadata row's status updates to processed.
Line items extracted (L3)
For annual and half-year reports, KPIs land in line_items/year=YYYY/country=XX/. Quarterly updates and ad-hoc disclosures are skipped — they don't have standardized financials.
Daily manifest written
manifests/incremental/YYYY-MM-DD.parquet lists every key written in the prior 24 h. Tail this for incremental sync.
Restatements (rare)
If an issuer corrects a filing, a new row is appended with the same filing_id but version=2. The original row is preserved for point-in-time analysis. Never edited in place.
IAM-only, region-pinned, audit-logged.
No public access. No long-lived keys outside our infrastructure. Every read is logged and available to you on request.
Cross-account IAM, no static keys.
Your AWS account ID is added to a bucket policy. Inside your account, you create the role and assume it however you like — STS, SSO, instance profile, OIDC. We never see your credentials.
SSE-S3 at rest, TLS 1.3 in transit.
All objects are encrypted with AWS-managed keys. Enterprise tier supports SSE-KMS with a CMK in your account if your security policy requires it.
CloudTrail data events, retained.
Every GetObject and ListBucket is logged. Enterprise tier streams the log directly to your SIEM.
Region pinning. The shared bucket is in eu-central-1 (Frankfurt). Same-region reads via a VPC Gateway Endpoint for S3 are zero-cost on AWS's side. Cross-region reads incur AWS egress charges to your account.
Questions every prospect asks.
If yours isn't here, email [email protected]. We'll add it.
Can I get the data in my own bucket instead?
Yes — that's the Enterprise tier. We replicate to a bucket you own using AWS S3 CRR / SRR. Same partitions, same files, your IAM. You control retention and lifecycle.
Is GCS or Azure Blob supported?
Not on the standard plans. On Enterprise we run a sync job into a GCS or ABFS bucket you own. Same source files, ~15 min added latency.
Do you offer a historical backfill download?
The standard bucket already contains the full historical corpus — no separate backfill. New customers can scan everything immediately.
How does this differ from the REST API?
Same data, different access pattern. The API is best for low-latency reads and event-driven apps. S3 is best for full-corpus scans, ML training, and warehouse mirrors. See §02 · API vs S3.
Are restatements visible in S3?
Yes. Every row carries a version column. SELECT … WHERE version = (SELECT max(version) ...) for current state, or filter on received_at for point-in-time joins. We never edit a file in place.
What if I'm in a different AWS region than the bucket?
DuckDB and Spark still read fine — AWS bills your account for cross-region egress at their published rate. For sustained workloads, ask us about replication into your region.
Can I get just one country?
The shared bucket includes everything; partition pruning means you only pay AWS for what you scan. If you specifically need a smaller bucket (e.g. for compliance reasons), Enterprise can carve out a country-scoped delivery.
Do you charge per request like the API?
No. Subscription is flat. AWS bills your account for storage class transitions, Athena scans, or egress out of the bucket region — those are between you and AWS.
Ready to point your warehouse at us?
Send your AWS account ID and the country / filing-type partitions you care most about. We'll provision read access in under a business day.