Mastering Chaos a Netflix Guide to Microservices

Challenges and Solutions

Its not a cache or database
frequently accessed metadata
no instance affinity
loss of a node is a non-event
Autoscaling groups
- Compute efficiency
- Node failure
- Traffic Spikes
- Performance Bugs
Chaos monkey tool test that when a node dies, the service continues to work

databases and caches
sometimes a custom app that holds large amounts of data (avoid storing business logic, and state within 1 application if you can avoid it)
loss of a node is a notable event
redundancy is fundemental
EVCache -> difference nodes -> each node has multiple shard caches
separate out systems used for batch versus real time transactions
do request level caching
have an encrypted token with the data to fall back on should the service be unavailable to updated the requested data

drift over time
- alert thresholds
- timeouts, retries, fallbacks
- throughput (RPS)
Across microservices
- Reliability best practices
Continious learning and automation
- Incident --> Resolution --> Review --> Remediation --> Analysis --> Best Practices? --> Automation --> Adoption
Production Ready best practices
- Alerts
- Apache & tomcat
- Automated canary Analysis
- Autoscaling
- Chaos
- Consistent naming
- ELB Config
- Healthcheck
- Immutable machine images
- Squeeze testing
- Staged, red/black deployments
- Timeouts, retries, fallbacks

The Paved Road (do this for a smooth experience)
- Stash
- Nebula/Gradle
- BaseAMI/Ubuntu
- Jenkins
- Spinnaker
- Runtime Platform
Cost of Variance
- Productivity Tooling
- Insight & Triage Capabilities
- Base Image Fragmentation
- Node management
- Library/Platform duplication
- Learning curve - production expertise
Strategic Stance
- Raise awareness of costs
- Constrain centralized support
- Prioritize by impact
- Seek reusable solutions