AI Data Lake Access with MCP and S3

Data warehouses handle structured queries. Data lakes handle everything else: documents, images, logs, backups, ML model weights, that CSV someone exported three years ago and never deleted. The S3 API has become the universal interface for object storage. AWS built it, and now MinIO, SeaweedFS, Ceph, and dozens of others implement it. txn2/mcp-s3 brings this storage layer to AI assistants through MCP, working alongside txn2/mcp-trino.

The problem

AI assistants can write code and answer questions, but they can’t see your data. Your sales reports live in S3. Configuration files sit in MinIO. ML artifacts are scattered across buckets. Without access to this storage layer, AI assistants work blind.

Most MCP servers expose a fixed set of tools and call it done. You configure credentials, run the binary, hope it does what you need. When requirements change, you fork.

I built txn2/mcp-s3 differently. The tools are just the surface. The library underneath is what matters.

txn2/mcp-s3

txn2/mcp-s3 is an MCP server that gives AI assistants access to S3-compatible storage. Install it, configure credentials, and your AI can browse buckets, read objects, and manage files.

Install:

go install github.com/txn2/mcp-s3/cmd/mcp-s3@latest

Configure Claude Code:

claude mcp add --transport stdio mcp-s3 -- \
  mcp-s3 --access-key AKIAIOSFODNN7EXAMPLE \
         --secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
         --region us-west-2

Your AI now has access to your S3 buckets.

How it works

graph LR
    AI[AI Assistant] -->|MCP stdio| MCP[mcp-s3]
    MCP -->|S3 API| S3[Object Storage]
    S3 --> AWS[AWS S3]
    S3 --> MIN[MinIO]
    S3 --> SEA[SeaweedFS]

The AI spawns mcp-s3 as a subprocess. When it needs storage access, it calls tools like s3_list_objects or s3_get_object. txn2/mcp-s3 translates these to S3 API operations and returns results.

The tools:

Tool	What it does
`s3_list_buckets`	List accessible buckets
`s3_list_objects`	Browse objects with prefix filtering
`s3_get_object`	Download object content
`s3_get_object_metadata`	Get metadata without downloading
`s3_put_object`	Upload (disabled in read-only mode)
`s3_delete_object`	Delete (disabled in read-only mode)
`s3_copy_object`	Copy between buckets
`s3_presign_url`	Generate temporary access URLs
`s3_list_connections`	Show configured connections

Read-only by default. The AI can explore and retrieve but not modify unless you explicitly enable writes.

Example conversation

Here’s Claude Code with txn2/mcp-s3 analyzing data in a storage bucket:

> What CSV files are in the analytics bucket and what do they contain?

I'll explore the analytics bucket to find CSV files.

⏺ mcp-s3:s3_list_objects
  ⎿  Running: bucket=analytics, prefix=reports/

Found 3 CSV files in reports/:
- reports/monthly_revenue_2025.csv
- reports/customer_segments.csv
- reports/product_performance.csv

Let me examine each one:

⏺ mcp-s3:s3_get_object
  ⎿  Fetching: analytics/reports/monthly_revenue_2025.csv

The monthly revenue file contains 12 rows with columns:
month, revenue, costs, profit_margin

Key findings:
- December 2025 had the highest revenue at $4.2M
- Profit margins averaged 23% across the year
- Q4 showed 15% growth over Q3

⏺ mcp-s3:s3_get_object
  ⎿  Fetching: analytics/reports/customer_segments.csv

Customer segments breakdown:
| Segment     | Count  | Avg Revenue |
|-------------|--------|-------------|
| Enterprise  | 142    | $28,400     |
| Mid-Market  | 891    | $4,200      |
| SMB         | 3,247  | $840        |

Enterprise customers represent 8% of accounts but 45% of revenue.

> Generate a presigned URL for the revenue report so I can share it

⏺ mcp-s3:s3_presign_url
  ⎿  Creating: bucket=analytics, key=reports/monthly_revenue_2025.csv,
              expires=3600

Here's a temporary URL valid for 1 hour:
https://analytics.s3.us-west-2.amazonaws.com/reports/monthly_revenue_2025.csv?...

Anyone with this link can download the file. The URL expires
automatically, so no permanent access is granted.

The AI navigates storage, reads files, interprets content, generates sharing links. No manual downloads, no local file juggling.

txn2/mcp-s3 as a Go library

Here’s where txn2/mcp-s3 differs from most MCP servers. It’s not just a binary. It’s a Go library you can import and extend.

package main

import (
    "github.com/txn2/mcp-s3/pkg/client"
    "github.com/txn2/mcp-s3/pkg/tools"
    "github.com/txn2/mcp-s3/pkg/extensions"
    "github.com/mark3labs/mcp-go/server"
)

func main() {
    // Create the S3 client
    cfg := client.Config{
        AccessKey: "AKIAIOSFODNN7EXAMPLE",
        SecretKey: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        Region:    "us-west-2",
    }
    s3Client, _ := client.New(ctx, &cfg)

    // Create toolkit with built-in interceptors
    toolkit := tools.NewToolkit(s3Client,
        tools.WithReadOnly(true),
        tools.WithMaxGetSize(10*1024*1024), // 10MB limit
    )

    // Create your MCP server
    s := server.NewMCPServer("my-storage-server", "1.0.0")

    // Add S3 tools with your middleware
    toolkit.RegisterToolsWithMiddleware(s,
        withAuditLog,
        withTenantIsolation,
        withContentFilter,
    )

    // Add your own custom tools
    s.AddTool(myCustomTool, myCustomHandler)

    s.Start()
}

graph TD
    CS[Your Custom MCP Server] --> S3[mcp-s3 library]
    CS --> YH[Your Handlers]
    S3 --> RO[ReadOnly Interceptor]
    S3 --> SL[SizeLimit Interceptor]
    S3 --> LG[Logging Middleware]

What ships in the box

The library includes middleware you’ll probably need:

The ReadOnlyInterceptor blocks all write operations:

readonly := extensions.NewReadOnlyInterceptor(true)

The SizeLimitInterceptor caps object sizes:

sizelimit := extensions.NewSizeLimitInterceptor(
    10*1024*1024,   // 10MB get limit
    100*1024*1024,  // 100MB put limit
)

The LoggingMiddleware records operations:

logging := extensions.NewLoggingMiddleware(logger)

Writing your own middleware

You can add whatever policy layers you need.

Tenant isolation that restricts access to specific bucket prefixes:

func withTenantIsolation(next server.ToolHandler) server.ToolHandler {
    return func(ctx context.Context, req server.ToolRequest) (*server.ToolResponse, error) {
        tenant := getTenantFromContext(ctx)
        bucket := req.Params["bucket"].(string)
        if !strings.HasPrefix(bucket, tenant+"-") {
            return nil, errors.New("access denied: bucket not in tenant scope")
        }
        return next(ctx, req)
    }
}

Content filtering that blocks sensitive file types:

func withContentFilter(next server.ToolHandler) server.ToolHandler {
    return func(ctx context.Context, req server.ToolRequest) (*server.ToolResponse, error) {
        key := req.Params["key"].(string)
        if strings.HasSuffix(key, ".env") || strings.Contains(key, "credentials") {
            return nil, errors.New("access to sensitive files blocked")
        }
        return next(ctx, req)
    }
}

Audit logging for compliance:

func withAuditLog(next server.ToolHandler) server.ToolHandler {
    return func(ctx context.Context, req server.ToolRequest) (*server.ToolResponse, error) {
        auditLog.Record(AuditEntry{
            User:      getUserFromContext(ctx),
            Tool:      req.Name,
            Bucket:    req.Params["bucket"],
            Key:       req.Params["key"],
            Timestamp: time.Now(),
        })
        return next(ctx, req)
    }
}

You control the server. No forking, no patching, no waiting for upstream.

Tools are just the surface

Most MCP discussions focus on tools: what functions the server exposes, what parameters they accept. This misses what actually matters.

Tools are the API surface. The interesting part is what happens between the AI’s request and the underlying system.

graph TB
    AI[AI Assistant] --> T[Tool Call]
    T --> AUTH[Authentication Layer]
    AUTH --> AUDIT[Audit Logging]
    AUDIT --> FILTER[Content Filtering]
    FILTER --> TRANSFORM[Response Transform]
    TRANSFORM --> S3[S3 API]

Think about what a composable MCP library lets you do.

You can build semantic layers that transform raw S3 keys into business concepts. reports/2025/q4/revenue.csv becomes “Q4 2025 Revenue Report” with metadata about owners, freshness, and data quality.

You can enforce access control at whatever granularity you need: tenant boundaries, time-based access windows, row-level security. The AI sees only what the current user should see.

You can transform responses before they reach the AI. Redact PII from documents. Summarize large files. Convert formats. The AI gets processed, policy-compliant data.

You can correlate across systems. “Show me all reports created by the finance team” requires understanding both S3 and your identity system. A composable library makes that possible.

None of this works with a black-box binary. You need a library with the right extension points.

The TXN2 MCP ecosystem

I built txn2/mcp-s3 alongside txn2/mcp-trino. Together they cover two patterns of data access that enterprises actually need:

graph LR
    AI[AI Assistant] --> TR[mcp-trino]
    AI --> S3[mcp-s3]
    TR --> DW[Data Warehouse]
    S3 --> DL[Data Lake]
    DW --> PG[PostgreSQL]
    DW --> MY[MySQL]
    DW --> IC[Iceberg]
    DL --> AWS[AWS S3]
    DL --> MIN[MinIO]
    DL --> SEA[SeaweedFS]

txn2/mcp-trino handles structured data: SQL queries, joins across databases, aggregations. When the AI needs “What were our top products last quarter?”, txn2/mcp-trino queries the data warehouse.

txn2/mcp-s3 handles unstructured data: documents, images, logs, exports. When the AI needs “Find the architecture diagram for the payments service”, txn2/mcp-s3 searches object storage.

Both are designed as composable libraries. Both support the same middleware patterns. You can build one custom MCP server that combines Trino queries with S3 storage, wrapped in your authentication and audit logging:

// Combine txn2/mcp-trino and txn2/mcp-s3 in one server
trinoTools := trino.NewTools(trinoConfig)
s3Toolkit := tools.NewToolkit(s3Client)

s := server.NewMCPServer("enterprise-data-server", "1.0.0")

trinoTools.RegisterToolsWithMiddleware(s, withAuth, withAudit)
s3Toolkit.RegisterToolsWithMiddleware(s, withAuth, withAudit)

s.Start()

One server, two data access patterns, unified policy layer.

Multi-provider support

S3 isn’t just AWS anymore. The API has become an industry standard:

AWS S3 (the original)
MinIO (popular for self-hosted)
SeaweedFS (distributed file system with S3 gateway)
Ceph (enterprise storage with S3-compatible RADOS Gateway)
LocalStack (S3 emulation for local dev)
Backblaze B2 (cheap cloud storage with S3 compatibility)

txn2/mcp-s3 works with all of them. Configure multiple connections and your AI can access objects across providers:

mcp-s3 --connections config.yaml

connections:
  - name: production
    endpoint: s3.amazonaws.com
    access_key: ${AWS_ACCESS_KEY}
    secret_key: ${AWS_SECRET_KEY}
    region: us-west-2

  - name: development
    endpoint: localhost:9000
    access_key: minioadmin
    secret_key: minioadmin
    force_path_style: true

  - name: archive
    endpoint: s3.us-west-002.backblazeb2.com
    access_key: ${B2_ACCESS_KEY}
    secret_key: ${B2_SECRET_KEY}

The AI can query across all configured connections. It understands that “production” contains live data while “archive” holds historical backups.

Safe defaults

txn2/mcp-s3 ships read-only with size limits. Sensible defaults matter when you’re giving an AI access to storage.

Writes are disabled unless you explicitly enable them. Size limits prevent downloading multi-gigabyte files. Prefix restrictions limit access to specific paths. Credentials never get exposed to the AI.

Enable writes when you need them:

mcp-s3 --read-only=false --max-put-size=100MB

Restrict to specific prefixes:

mcp-s3 --allowed-prefixes="reports/,exports/"

With this flag, the AI can only access objects under reports/ or exports/, no matter what it asks for.

Try it with LocalStack

You can test txn2/mcp-s3 without cloud credentials using LocalStack:

# Start LocalStack
docker run -d -p 4566:4566 localstack/localstack

# Configure AWS CLI for LocalStack
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
export AWS_DEFAULT_REGION=us-east-1

# Create a test bucket
aws --endpoint-url=http://localhost:4566 s3 mb s3://test-bucket

# Upload test data
echo "Hello from S3" | aws --endpoint-url=http://localhost:4566 \
  s3 cp - s3://test-bucket/hello.txt

# Configure mcp-s3
claude mcp add --transport stdio mcp-s3 -- \
  mcp-s3 --endpoint http://localhost:4566 \
         --access-key test --secret-key test \
         --force-path-style

Now ask your AI to explore the test bucket.

Why composable wins

The MCP ecosystem is young. Everyone is shipping binaries that expose tools. But tools are commodity work. Any project can implement list_buckets and get_object.

What matters is the library architecture. Can you intercept requests? Transform responses? Compose multiple data sources? Add enterprise policy without forking?

Projects that can do this become infrastructure. Projects that ship fixed binaries become dependencies you replace when requirements change.

txn2/mcp-s3 and txn2/mcp-trino bet on composability. Use them as binaries for quick starts. Import them as libraries for production. Same codebase, both use cases.

Resources

Note: This blog is a collection of personal notes. Making them public encourages me to think beyond the limited scope of the current problem I'm trying to solve or concept I'm implementing, and hopefully provides something useful to my team and others.

This blog post, titled: "AI Data Lake Access with MCP and S3: Building composable MCP servers for object storage" by Craig Johnston, is licensed under a Creative Commons Attribution 4.0 International License.