// systems shipped
Built.
Not described.
Production systems with real benchmarks, real architecture decisions, and real things that went wrong along the way.
Why SAGA choreography
Chose choreography over orchestration — each Lambda reacts to SQS events independently. No central brain. Better fault isolation and service autonomy. Compensating transactions handle rollback without a coordinator knowing about it.
Hardest engineering problems
Idempotency under duplicate SQS delivery. Partial failures mid-SAGA. Circuit breaker calibration to prevent cascade failures without over-tripping on transient errors. Getting X-Ray to trace across Lambda boundaries without noise.
Architecture decision
Lambda Architecture separates real-time and batch cleanly. Speed layer handles low-latency dashboards via Kinesis → Lambda → DynamoDB. Batch layer stores full event history in S3 for Athena SQL queries — both paths run independently.
Interesting detail
DynamoDB 24-hour TTL auto-cleans stale data without Lambda or cron. Glue Data Catalog makes S3 data queryable via standard SQL in Athena. Cognito JWT auth means API stays stateless and horizontally scalable.
What it detects
CIS Benchmark v1.5 controls across IAM (root MFA, key rotation, overpermissive policies), S3 (public access, versioning, logging), EC2 (security groups, encryption), and CloudTrail (logging enabled, multi-region).
Why this stack
EventBridge for scheduling over CloudWatch Events — better audit trail. SNS over SES for alerts — topic-based, scales to multiple subscribers. All infrastructure in Terraform — CI auto-deploys scanner updates, zero manual provisioning.
The finding
TAR (Trust and Access Revocation) preservation in distributed marketplaces requires atomic revocation. Partial revocation — revoking access step-by-step — creates a race condition window where unauthorized access is possible. Proven by BFS exhaustive state space exploration.
Why it's real research
Not an implementation. A correctness theorem with a formal proof. Built a custom BFS model checker in Python to exhaustively explore the protocol state space. Found the violation at state 42 of 14,872. Published on arXiv with TLA+-style spec and full methodology.
Built without wrappers
No LangChain. No LlamaIndex. Raw embedding API calls, manual vector indexing, custom retrieval pipeline, hand-written prompt templates. Understanding the internals means debugging when things break — which they do, in production.
Production relevance
Every cloud team is building RAG pipelines over internal runbooks, documentation, and knowledge bases. AWS Q Business, Google Vertex AI Search, and Azure AI Search all use this pattern. Understanding it at the implementation level is increasingly required.