class: center, middle # Path to Resilient and Observable Microservices Slides: https://slides.peterj.dev .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ## Safe Harbor .medium[ The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, coe, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and princing of any features or functionality described for Oracle's products may change and remains at the sole discretion of Oracle Corporation. Statements in this presentation relating to Oracle's future plans, expectations, beliefts, intentions and prospects are "forward-looking statements" and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that affect our business is contained in Oracle's Securities and Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q under the heading "Risk Factors." These filings are available on the SEC's website or on Oracle's website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events.] --- ## Introduction - I am Peter ([@pjausovec](https://twitter.com/pjausovec)) - Software Engineer at Oracle - Working on "cloud-native" stuff - Books: - [Cloud Native: Using Containers, Functions, and Data to Build Next-Gen Apps](https://www.amazon.com/Cloud-Native-Containers-Next-Generation-Applications/dp/1492053821) - SharePoint Development - VSTO For Dummies - Courses: - Kubernetes Course ([https://startkubernetes.com](https://startkubernetes.com)) - Istio Service Mesh Course ([https://learnistio.com](https://learnistio.com)) --- name:intro background-image: url(./images/microservices.jpeg) background-size: cover ??? What I am going to talk about today is not microservices. At least not directly. I am assuming you probably heard about microservices a lot. Anyone in the room already implementing microservices? I am not going to spend too much time on the theory; I am just going to show you some things to keep in mind while implementing microservices, some tools you can use to make them more resilient and how to test them. --- class: pic ![Monolith - 1](./images/monolith-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Assuming you are starting with a monolith - the first thing you will notice when you start breaking it apart... --- class: pic ![Monolith - 2](./images/monolith-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Is that now you have more things to manage and more things to think about. --- class: center, middle ## Microservices ### .highlight[Independently deployable] services, owning their dataand having well-defined interfaces .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Quickly looking at the microservices definition - they are independently deployable services that are modelled around a business domain. They own their own data and expose it through well-defined APIs. The really important thing to understand with microservices is the idea of independent deployments - this means that you can change a single microservice and deploy it into production without touching or affecting any other services. --- class: pic ![Monolith deploy](./images/monolith-deploy-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? If you think about a monolith, you would usually have a single code base and a single deployment. You would do your checking, that kicked off a CI/CD process that perhaps executed tests and finally deployed your monolith to production. --- class: pic ![Microservices deploy](./images/microservices-deploy-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? With microservices you are doing multiple deployments and probably have multiple code bases. --- class: pic ![Microservices deploy - 2](./images/microservices-deploy-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Alternatively, you might have a single code base and multiple pipelines to be able to independently deploy all services. In addition to this you are not running one instance of each component, you are running multiple instances of each service and so on. With so many instances and independent deployments, there are a lot of things that can go wrong - and not just with your deployment process, but also after your services have been deployed. --- name: resiliency class: middle, center # Service Resiliency .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? This is where resiliency comes into play. --- class: middle, center # Resiliency ### Ability to .highlight[recover from failures] and .highlight[continue to function] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? what is resiliency? It’s not about avoiding failures, it’s responding to failures and recovering from them in such a way that there’s no downtime or data loss. --- class: middle, center # Goal ### Return the service to a .highlight[fully functioning state] after failure .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? A simple goal of resiliency is to return the service to a fully functioning state after a failure occurs --- ## Resiliency - High availability (1/2) - Healthy 🏥 - No significant downtime ⏱ - Responsive 🏎 - Meeting SLAs/SLOs 💰 .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? There are two aspects to the service resiliency One is high availability which basically says that : - Service is running in a healthy state - There’s no significant downtime - Service is responsive and users can interact with it - Service is meeting SLAs etc. --- ## Resiliency - Disaster Recovery (2/2) - How to recover from incidents - Data backup and archiving - DR starts when HA design can't handle the impact of failures .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Second aspect is disaster recovery Disaster recovery is all about how to recover from rare but major incidents (involves data backup, archiving, etc.) We can also say that disaster recovery starts when high availability design can’t handle the impact of faults --- ## Path to resiliency - Understand the requirements - Define service availability - Design for resiliency - Strategies for detection and recovery - Testing - Monitoring .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? There are a couple of steps to make your services more resilient. --- ## Path to resiliency **Understand the requirements** - How much downtime can you handle/is acceptable? - More downtime → broken SLAs/SLOs → .big[😢] **Define service availability** - What does it mean for service to be available? **Design for resiliency** - Identify failures points - what and how can things wrong? .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? 1. **Understand the requirements**: how much downtime can you handle or how much downtime is acceptable? related to cost: more downtime -> broken SLAs/SLOs -> unhappy customers -> losing money 1. **Define service availability**: What does it mean for a service to be available? Is it that the website is up and running or is it also that you can submit orders/make purchases? 1. **Design for resiliency**: Identify failure points in your systems’ architecture: what can go wrong and how can it go wrong --- ## Path to resiliency **Strategies for detection and recovery** - How are you detecting and recovering from failures? **Testing and monitoring** - Test for failure conditions so you can detect and recover from them - Monitor your services, so you know what's happening .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? 1. **Strategies for detection & recovery**: how are you going to detect the failures and recover from failures. - Ask Qs such as: How will the service detect the failure, how will respond to it and how are you going to log and monitor for this failure For example: if service is unavailable, you will return a 5xx HTTP status code; if caller is not authenticated, you will return a 401, etc. 1. **Testing**: simulate failure conditions to ensure you are able to detect and recover from them 1. **Monitoring**: in order to know what you’re doing, you need to have monitoring in place --- ## Resiliency Strategies - Load Balancing - Timeouts and retries - Circuit breakers and bulkhead pattern - Data replication - Graceful degradation - Rate limiting .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Next I am going to go through some strategies that can help make your service more resilient. --- class: middle, center ## Load Balancing ### Scale services by .highlight[adding more instances] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Load balancing: any service should be able to scale out by adding more instances. There are different Load balancing algorithms that can help with achieving better resiliency: e.g. round robin, random, least conn if you are using Kubernetes for example, you don't have to worry about this --- ## Timeouts and Retries **Timeouts** - Network latency: how long do you wait for responses? - .highlight[Waiting indefinitely == bad] - Always define timeouts! **Retries** - Helps handle transient network failures - Only retry calls that make sense - Idempotent operations - Consider appropriate retry counts and intervals between retries .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Timeouts: - Crucial element in making services available; network introduces a lot of unpredictable behavior, one of them being latency ,so how long do you wait for the service response? Is something wrong or is service just slow? Do you just wait? Waiting == bad - there’s usually a customer at the other end that’s waiting for something; also waiting uses resources and might cause other services to wait, leading to potential cascading failures Solution: Always have timeouts! Retries: Even with timeouts in place, network could be experiencing transient or short-lived failures and if only we retried one more time, the call might have succeeded. When implementing retries you need to make sure you are only retrying calls that make sense to be retried. For example, it doesn't make sense to retry the exact same call that failed with HTTP 403 - Forbidden. Similarly, it doesn't make sense to retry non-idempotent operations as that might cause unwanted side effects (e.g. same operation applied multiple times). You also need to think about how are you retrying the calls - for example how many retries and what is the interval you wait between retrying. Having an agressive retry policy might degrade your system even further. If you have retried the calls multiple times and it still fails, it might make more sense to prevent further requests right away and return a failure and this is where the next pattern comes in. --- class: center,middle ## Circuit breaker ### Prevents doing an operation that is likely to fail .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Retries can help you with overcoming short-lived network failures, however there are cases where failures might take longer to fix. In that case it makes sense to use a circuit breaker. The idea behind a circuit breaker is to prevent doing an operation that is likely to fail. So instead of making another retry the circuit breaker fails fast and returns a failure right away. After a preset time, the circuit breaker resets itself and the calls are going to be made to the service again. --- class: pic ![Circuit breaker - 1](./images/cb-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Let's say we have two services - everything is good the circuit breaker is closed and the calls are going through. --- class: pic ![Circuit breaker - 2](./images/cb-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? For whatever reason, the calls start to fail. The retries kick in and the operations are retried, however those are failing as well. After we have seen the same call fail for 10 times for example, it's time to trip the circuit breaker. --- class: pic ![Circuit breaker - 3](./images/cb-3.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The circuit breaker has tripped and it's not letting any more calls through to the service. Instead, it returns an error right away. --- class: pic ![Circuit breaker - 4](./images/cb-4.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Finally after some pre-defined time, the circuit breaker will reset itself and it will start letting the calls go through again. If they start failing again and the failure threshold is reached again, the circuit breaker will trip again. --- class: center, middle ## Bulkhead pattern ### Isolate resources in such a way that if one fails, it's not affecting others .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Bulkhead pattern: isolate critical resources, so failures in one subsystem can’t cascade and cause failures in other parts of the application If using Kubernetes, you should for example configure the memory and CPU limits for your pods, so one service can't eat up all the resources for example. --- class: middle, center ## Data replication ### Handle non-transient failures in the data store .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? General strategy for handling non-transient failures in data store; many storage technologies have the replication built in; --- class: center,middle ## Graceful degradation ### Ability to .highlight[maintain limited functionality] in face of failures .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? If one service fails, the whole system should continue working; for example, you can provide an acceptable user experience even without all services up and running. Examples: - traffic lights - when something is wrong, the traffic lights won't go off, the yellow light will blink --- class:middle, center ## Rate limiting ### Restrict the number of requests made in a period of time .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The idea is to restrict a number of requests a single client can make in a period of time. This is a way to prevent users creating excessive and unnecessary load on the system. --- class: middle,center # How? .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The question is - how do you do all this and implement it in the real world? I’ll talk about how service mesh can help you with this, but it’s definitely not the only way to do this. --- name: service-mesh-overview class: center, middle # Service Mesh .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- class: center, middle ### Dedicated infrastructure layer to .highlight[connect], .highlight[manage], and .highlight[secure] workloads by managing the communication between them .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Enter service mesh – next to microservices, you might have heard people talk about service mesh as well; I’ll just briefly explain what this is. The definition says that service mesh is a dedicated infrastructure layer for managing service-to-service communication to make it manageable, visible and controlled – so it’s all about communication between services --- ## Istio service mesh - Open source service mesh - Google, IBM, Lyft - Well-defined API - Can be deployed on-premise, in the cloud - Kubernetes - Mesos .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? I'll be talking about Istio service mesh. Istio is an open source service mesh that is backed by Google, IBM and Lyft with contributions from other companies as well. It has a well-defined API and you can think of it as a platform as you can build your own customizations on top of it. It can also run either on-premise or in the cloud - you can deploy it on VMs or with orchestration platforms such as Kubernetes and mesos. All demos that I'll show you today will be running on an Istio service mesh deployed on a Kubernetes cluster. --- class: pic ![Service Mesh - Code outside of the service](./images/sidecar.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? So how does the service mesh work? Istio service mesh injects an Envoy proxy (edge and service proxy created by Lyft and designed for cloud-native applications) next to every service you are running inside Kubernetes. This proxy lives as a second container in the same Kubernetes pod your service is in. One of the properties of a Kubernetes pod is that all containers within a single pod share the network, and this means the proxy use localhost to access the service. Any requests coming to the service or going out of the service are intercepted by this proxy. --- class: center, middle ![A calling B - 1](./images/a-calling-b-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? So if service A calls service B the request goes through the proxy A --- class: center, middle ![A calling B - 2](./images/a-calling-b-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The proxy on the service A side, calls proxy B --- class: center, middle ![A calling B - 3](./images/a-calling-b-3.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? And finally proxy next to the service B instance will forward the request to the service B. --- background-image: url(./images/sidecar-dog.jpg) background-size: cover .copyright[Source: https://barkpost.com/cute/sidecar-dogs/] ??? This type of a proxy deployment is called **sidecar** deployment and it allows Istio to extract information about traffic. This information is then used by the service mesh to enforce policy decisions and send the information to monitoring. --- class: center,middle name:mesh-resiliency # Service Mesh & Resiliency .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ## Resiliency Strategies - Service Mesh - .highlight[Load Balancing] - .highlight[Timeouts and retries] - .highlight[Circuit breakers] and .disabled[bulkhead pattern] - .disabled[Data replication] - .disabled[Graceful degradation] - .highlight[Rate limiting] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? If we go back to the list of strategies – this is how a service mesh can help us; it can do the load balancing for us and it allows us to implement timeouts, retries, circuit breakers and rate limiting There’s also other things a service mesh can do, such as traffic management and security, but those are out of scope for this talk --- ## Testing for resiliency - Test - Measure - Analyze (fix the issues) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Let’s talk about how you test for service resiliency. It’s not the same as running unit tests and testing for functionality You need to test how your services perform under failure conditions; At a very high level, there are 3 steps when your doing testing: You conduct the test You measure behavior You analyze & fix the issue Then you repeat the whole cycle again and again. --- ## Testing for Resiliency - Service Mesh ### Inject failures - .highlight[Delays] - Example: "For 30% of the requests, wait 5 seconds before responding" - .highlight[Faults] - Example: "For 50% of the requests, return HTTP 404" ??? Service mesh has the ability of injecting failures and injecting delays For example: you can say that X% of the requests made to the service return HTTP code 404 X% of the requests take an extra 5 seconds --- ### Failure injection ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b.default.svc.cluster.local subset: v1 * fault: * abort: * percent: 60 * httpStatus: 404 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ### Delay injection ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b.default.svc.cluster.local subset: v1 * fault: * delay: * percent: 20 * fixedDelay: 3s ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ### Timeouts & Retries ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b.default.svc.cluster.local subset: v1 * timeout: 5s * retries: * attempts: 6 * perTryTimeout: 2s * retryOn: gateway-error,connect-failure ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? By default any timeouts for HTTP requests in Istio are disabled, but we can set or enable them by adding a timeout field to the Virtual Service. You can also dynamically overwrite the timeout from your service using the above header value --- ### Circuit Breakers ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: service-b spec: host: service-b.default.svc.cluster.local * trafficPolicy: * tcp: * maxConnections: 1 * http: * http1MaxPendingRequests: 1 * maxRequestsPerConnection: 1 * outlierDetection: * consecutiveErrors: 1 * interval: 1s * baseEjectionTime: 3m * maxEjectionPercent: 100 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Circuit breaker settings can be applied on the destination rule for the service. Compared to retries and timeouts there are way more knobs you can adjust. In these settings you can define the thresholds that will trip the circuit breaker, as well as the percentage of instances you want to eject from the load balancing pool when they start failing and for how long you want to eject them. First thing that needs to be defined is the connection pool settings. You can apply them both at the TCP level and at the HTTP level. For TCP, the following fields can be set: max connections Connection timeout TCP keep alive settings (number of probes to send before deciding connection is dead, time and interval) For HTTP: http1MaxPendingRequests/http2MaxRequests -> maximum number of requests (Default is 1024) maxRequestsPerConnection -> max number of requests per connection maxRetries -> max number of retries that can be outstanding idleTimeout -> period in which there are no active requests. When idle timeout is reached, the connection is closed With these values set, you are limiting the connection pool and requests that will be made to the service; if we go over these values we are short-circuiting the circuit breaker. For example, if we’d make more than 1 connection to the service, the circuit break would short circuit calls that didn’t match the config settings. What about if nodes go down in the cluster? Or some of our pods keep restarting? This is where the outlier detection comes into play. The outlier detection can detect when hosts are not reliable and can eject them for a period of time. So if we think about a Kubernetes service that load balances between 10 pods – if some of those pods are unhealthy and the outlier detection detects that, it will exclude or eject those pods from the load balancing pool. Let’s look at the settings: consecutiveErrors: number of errors before a host is ejected from the pool (default value is 5) – 502, 503 or 504 qualify as errors; for TCP, timeouts and connection errors qualify as errors Interval: period between host scanning; for example, the 10 second default value means that the upstream hosts will be scanned every 10 seconds and analyzed for failures baseEjectionTime: this value represents minimum ejection duration (default is 30 seconds) maxEjectionPercent: max % of hosts in the pool that can be ejected – default is 10% -- this means if more than 10% of the hosts are failing, only 10% will be ejected. minHealthPercent: with this value you can control if outlier detection is enabled or not. For example: if you set the minHealthPercent to 50% and the number of unhealthy hosts drops below 50%, the outlier detection is disabled and in that case proxy will load balance across all hosts in the pool, regardless if they are healthy or not --- class: center,middle name:observability # Observability .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- class:center, middle ## What is observability? ### Act of .highlight[measuring], .highlight[collecting] and .highlight[analyzing] metrics, traces, logs, events, … from services .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Lets talk about observability; observability is the act of measuring, collecting and analyzing metrics, traces, logs, events and any other signals from your services It is crucial for resiliency: you need to know something failed and have enough information to figure out what caused the failure; it allows you to investigate stuff There are multiple parts to the observability pipeline: Your services need to emit data (metrics, logs, etc.) The metrics have to be collected and stored somewhere centrally Once data is collected and consolidated, you can analyze it Finally, you can analyze the data (e.g. search for errors in the logs), visualize different metrics (Grafana) and potentially alert on certain conditions (Pagerduty alerts on dropped availability, errors, etc.) As part of the observability, we can talk about 3 patterns: logging, monitoring, alerting --- ## Logging - Log granularity: verbose, debug, warning, info, error, ... - Storage is cheap - rather log more than less - Store logs in a central place - Use correlation IDs - Don't log private information - Use common format ??? If there's one thing we all can probably agree on is that applications and services will fail and they will behave in unexpected ways when deployed to production. When that happens it is important to know what was happening with your services at the time the errors happened. The way you can do that is to write logs from your services. Good logging is crucial to be able to reproduce the errors your end-users are experiencing. First thing to consider is how granular your log messages will be - ideally this should be configurable, so if needed you can switch on more detailed logging. Logs from all your services and components should also go to a central place where you can search through all the logs if needed. With all logs in one place it is also important to include a unique correlation ID with every requests. This will allow you to stitch together logs from different services and follow them throughout your system. --- ## Monitoring & Alerting **Monitoring** - Know the state of your system - Collect system/infrastructure metrics (e.g. CPU, memory, ...) and business metrics (instrument services) **Alerting** - Queries that look for failures/failure conditions - PagerDuty -> alert on-call engineers - Only fire alerts when something is really wrong ??? The idea behind monitoring is that you know what's happening with your system. You want to have your monitoring set up in such a way that you know right away when something goes wrong or even before something goes wrong. You don't want your users to tell you that your application is not working. Monitoring involves collecting any system and infrastrcutrue metrics such as CPU, memory, disk space etc. Additionally, most monitoring tools allow you to collect traces from your services as well - this allows you to instrument your services, basically to tell them when to emit a certain trace and then you can visualize this in your monitoring tool. For example, you can emit an event each time someone purchases something or someone signs up and so on. Just monitoring these things is useful, however you also want to be notified or alerted if or when something goes wrong. THis is where alerting comes in. Alerting depends on logging and monitoring and it allows you to create queries that can look for failures. For example, you can create an alert that will fire off if there are errors in your logs OR if the memory consumption of your nodes is more than 90% and similar. These alerts can then hook into tools such as Pager Duty to notify the on-call engineer when something goes wrong. Finally, you only want to fire alerts when stuff is really a problem. You probably don't want to have someone look at the alerts if there's a single failure in the log as people will start ignoring the alerts. --- ## Tools - Grafana - Jaeger - Kiali - ELK (Elastic Search + Fluentd + Kibana) - PagerDuty ??? --- class:pic ![Weather Demo](./images/weather-demo.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: resources # Resources - Kubernetes on Oracle Cloud (OKE) - (https://cloud.oracle.com) - Kubernetes - (https://kubernetes.io) - Istio - (https://istio.io) **Oracle Microservices Example** - MuShop - https://github.com/oracle-quickstart/oci-cloudnative .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: thankyou ## Thank you **Slides**: https://slides.peterj.dev **Contact** - [@pjausovec](https://twitter.com/pjausovec) - [https://peterj.dev](https://peterj.dev) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name:toc ### Table of Contents .small[ - [Introduction](#intro) - [Service Resiliency](#resiliency) - [Service Mesh](#service-mesh-overview) - [Service Mesh & Resiliency](#mesh-resiliency) - [Resources](#resources) ] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)]