Projects — Utkarsh Batham

01

Distributed Systems Serverless SAGA Pattern

CloudFlow — Distributed SAGA Order Processing

Python · AWS Step Functions · Lambda · DynamoDB · SQS · CDK · LocalStack · X-Ray

+

1,100+Req/min

<120msP99 Latency

30+Tests

SAGAPattern

AutoCompensation

// architecture flow

API GW

→

Order λ

→

SQS

→

Reserve λ

Payment λ

→

Confirm λ

→

DynamoDB

→

X-Ray

On failure → compensating transactions auto-rollback

load_test.py — LocalStack

# 50 concurrent threads

$ python load_test.py --threads 50 --duration 60

Requests sent: 68,412

Success rate: 99.97%

Req/min peak: 1,147

P50 latency: 34ms

P99 latency: 118ms

Compensation: 100% triggered

Why SAGA choreography

Chose choreography over orchestration — each Lambda reacts to SQS events independently. No central brain. Better fault isolation and service autonomy. Compensating transactions handle rollback without a coordinator knowing about it.

Hardest engineering problems

Idempotency under duplicate SQS delivery. Partial failures mid-SAGA. Circuit breaker calibration to prevent cascade failures without over-tripping on transient errors. Getting X-Ray to trace across Lambda boundaries without noise.

Why a recruiter should care: SAGA choreography is how Uber, Netflix, and Amazon handle long-lived distributed transactions in production. Building it without a framework — understanding every moving part — demonstrates the systems thinking that separates senior engineers from developers who just call managed services.

→ GitHub

02

Analytics Serverless Lambda Architecture

CloudPulse — Real-Time Analytics Platform

Python · Kinesis · S3 · Athena · DynamoDB · Terraform · React · Cognito · Glue

+

DualPath Lambda Arch

FreeAWS Tier

100%Terraform IaC

JWTCognito Auth

// lambda architecture

Speed Layer

Kinesis

→

Lambda

→

DynamoDB TTL

Batch Layer

SQS

→

S3 / Glue

→

Athena SQL

// cost + scale profile

Monthly cost

$0 (Free Tier)

IaC coverage

100%

Manual setup

0 clicks

Data freshness

<30s

Architecture decision

Lambda Architecture separates real-time and batch cleanly. Speed layer handles low-latency dashboards via Kinesis → Lambda → DynamoDB. Batch layer stores full event history in S3 for Athena SQL queries — both paths run independently.

Interesting detail

DynamoDB 24-hour TTL auto-cleans stale data without Lambda or cron. Glue Data Catalog makes S3 data queryable via standard SQL in Athena. Cognito JWT auth means API stays stateless and horizontally scalable.

Why it matters: Lambda Architecture is used by Netflix, LinkedIn, and Twitter for analytics at scale. Running this within AWS Free Tier shows infrastructure cost awareness — a trait most junior engineers lack.

→ GitHub

03

Security CIS Benchmark Automation

CSPM — Cloud Security Posture Management

Python · Lambda · EventBridge · SNS · S3 · CloudWatch · Terraform · GitHub Actions

+

CISBenchmark v1.5

<5sAlert latency

HourlyEventBridge scan

AutoRemediation

4+AWS services scanned

cspm_scan.py — scan output

$ python cspm_scan.py --env prod

[IAM] Scanning policies... done

[CRITICAL] Root account MFA disabled

[HIGH] 3 overly permissive policies found

[S3] Checking bucket ACLs...

[HIGH] 2 public buckets detected

[SNS] Alert dispatched in 3.2s

[AUTO] Remediating safe violations...

[DONE] Report saved → s3://audit/2026-05

// scan pipeline

EventBridge

→

Scanner λ

IAM checks

S3 checks

EC2 checks

SNS alert

→

S3 audit log

Auto-remediate safe violations · block critical

What it detects

CIS Benchmark v1.5 controls across IAM (root MFA, key rotation, overpermissive policies), S3 (public access, versioning, logging), EC2 (security groups, encryption), and CloudTrail (logging enabled, multi-region).

Why this stack

EventBridge for scheduling over CloudWatch Events — better audit trail. SNS over SES for alerts — topic-based, scales to multiple subscribers. All infrastructure in Terraform — CI auto-deploys scanner updates, zero manual provisioning.

Why it matters: CSPM is a $9B market. Wiz raised $1B doing this at enterprise scale. Building it from scratch shows security engineering depth, not just cloud operations. Most engineers know how to deploy — fewer know how to secure what they deploy.

→ GitHub

04

Research Formal Methods arXiv

AgriFuture India — Formal Verification Research

TLA+ · BFS Model Checker (Python) · Formal Verification · Distributed Marketplace

+

NovelResearch finding

BFSState exhaustion

TLA+Formal spec

arXivPublished

tlc_agrifuture_final.py

# BFS model checker — TAR preservation

$ python tlc_agrifuture_final.py

States explored: 14,872

Violations found: 1 CRITICAL

TAR violation in partial revocation path:

state[42] → race condition window open

Finding: atomic revocation required

// verification methodology

TLA+ Spec

→

BFS Checker

State space

→

14,872 states

Finding

→

arXiv preprint

Novel: TAR preservation requires atomic revocation

The finding

TAR (Trust and Access Revocation) preservation in distributed marketplaces requires atomic revocation. Partial revocation — revoking access step-by-step — creates a race condition window where unauthorized access is possible. Proven by BFS exhaustive state space exploration.

Why it's real research

Not an implementation. A correctness theorem with a formal proof. Built a custom BFS model checker in Python to exhaustively explore the protocol state space. Found the violation at state 42 of 14,872. Published on arXiv with TLA+-style spec and full methodology.

Why it matters: TLA+ is used by Amazon engineers to verify DynamoDB and S3 protocols. Microsoft uses it for distributed systems in Azure. Most CS graduates have never written a formal spec. This is a differentiator that can't be faked.

→ GitHub → arXiv preprint

05

RAG / AI LLMs Vector Search

Wikipedia Smart Search — RAG QA System

Python · RAG · Vector Search · Embeddings · LLMs · Semantic Retrieval

+

RAGArchitecture

VectorSemantic Search

ZeroHallucination design

NoFramework wrappers

rag_pipeline.py

# Query: "SAGA pattern in distributed systems"

$ python query.py --q "SAGA pattern"

Embedding query... done (34ms)

Vector search top-5... done (12ms)

Chunks retrieved: 5

Context window: 3,847 tokens

LLM inference... done (1.2s)

Answer: grounded response with citations

// RAG pipeline

Wikipedia

→

Chunk + Embed

Vector Index

←

Query Embed

Top-k chunks

→

LLM

→

Answer

Grounded response — no hallucination

Built without wrappers

No LangChain. No LlamaIndex. Raw embedding API calls, manual vector indexing, custom retrieval pipeline, hand-written prompt templates. Understanding the internals means debugging when things break — which they do, in production.

Production relevance

Every cloud team is building RAG pipelines over internal runbooks, documentation, and knowledge bases. AWS Q Business, Google Vertex AI Search, and Azure AI Search all use this pattern. Understanding it at the implementation level is increasingly required.

→ GitHub

Built.Not described.

Built.
Not described.