class: center, middle # Istio Service Mesh Patterns Slides: https://slides.peterj.dev/oracle-doag-2019 .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ## Safe Harbor .medium[ The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, coe, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and princing of any features or functionality described for Oracle's products may change and remains at the sole discretion of Oracle Corporation. Statements in this presentation relating to Oracle's future plans, expectations, beliefts, intentions and prospects are "forward-looking statements" and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that affect our business is contained in Oracle's Securities and Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q under the heading "Risk Factors." These filings are available on the SEC's website or on Oracle's website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events.] --- ## Introduction - I am Peter ([@pjausovec](https://twitter.com/pjausovec)) - Software Engineer at Oracle - Working on "cloud-native" stuff - Books: - [Cloud Native: Using Containers, Functions, and Data to Build Next-Gen Apps](https://www.amazon.com/Cloud-Native-Containers-Next-Generation-Applications/dp/1492053821) - SharePoint Development - VSTO For Dummies - Courses: - Kubernetes Course ([https://startkubernetes.com](https://startkubernetes.com)) - Istio Service Mesh Course ([https://learnistio.com](https://learnistio.com)) --- name: service-mesh-overview class: center, middle # Service Mesh .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- class: center, middle ### Dedicated infrastructure layer to .highlight[connect], .highlight[manage], and .highlight[secure] workloads by managing the communication between them .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? - service mesh is all about communication between services --- ## Istio service mesh - Open source service mesh - Google, IBM, Lyft - Well-defined API - Can be deployed on-premise, in the cloud - Kubernetes - Mesos .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? I'll be talking about Istio service mesh. Istio is an open source service mesh that is backed by Google, IBM and Lyft with contributions from other companies as well. It has a well-defined API and you can think of it as a platform as you can build your own customizations on top of it. It can also run either on-premise or in the cloud - you can deploy it on VMs or with orchestration platforms such as Kubernetes and mesos. All demos that I'll show you today will be running on an Istio service mesh deployed on a Kubernetes cluster. --- class: pic ![Service Mesh - Code outside of the service](./images/sidecar.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? So how does the service mesh work? Istio service mesh injects an Envoy proxy (edge and service proxy created by Lyft and designed for cloud-native applications) next to every service you are running inside Kubernetes. This proxy lives as a second container in the same Kubernetes pod your service is in. One of the properties of a Kubernetes pod is that all containers within a single pod share the network, and this means the proxy use localhost to access the service. Any requests coming to the service or going out of the service are intercepted by this proxy. --- class: center, middle ![A calling B - 1](./images/a-calling-b-1.png) ??? So if service A calls service B the request goes through the proxy A --- class: center, middle ![A calling B - 2](./images/a-calling-b-2.png) ??? The proxy on the service A side, calls proxy B --- class: center, middle ![A calling B - 3](./images/a-calling-b-3.png) ??? And finally proxy next to the service B instance will forward the request to the service B. --- background-image: url(./images/sidecar-dog.jpg) background-size: cover .copyright[Source: https://barkpost.com/cute/sidecar-dogs/] ??? This type of a proxy deployment is called **sidecar** deployment and it allows Istio to extract information about traffic. This information is then used by the service mesh to enforce policy decisions and send the information to monitoring. --- ## Service Mesh - Architecture **Data plane (proxies)** - Run next to each service instance (or one per host) - Istio uses Envoy proxy - Intercept all incoming/outgoing requests (`iptables`) - Configure on how to handle traffic - Emits metric **Control plane** - Validates rules - Translates high-level rules to proxy configuration - Updates the proxies/configuration - Collects metrics from proxies .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? At a high level, any service mesh has at least two parts: the data plane and the control plane. The data plane is the collection of all proxies that run next to each service. These proxies intercept any traffic coming to the service or leaving the service. They are driven by a configuration and have the ability to automatically reload the configuration when it changes. Additionaly, they can also emit metrics about the traffic that passes through them. The job of a control plane is as the name suggest to 'control' the data plane or the proxies in the data plane. Control plane is usually the entry point or the endpoint that developers can use to control the service mesh. The components inside the control plane take the high-level rules (traffic rules, configuration, or policies), they can validate them and then translate them to the proxy-specific configuration and push them to all the proxies. There's other components that deal with metrics or security that are usually run as part of the control plane as well. Some people also talk about a management plane whos purpose would be to manage multiple meshes across multiple clusters. --- ## Service Mesh - Features **Connect** - Layer 7 routing and traffic management - %-based traffic split (URIs, header, scheme, method, ...) - Circuit breakers, timeouts and retries **Manage** - Telemetry (proxies collect metrics automatically -> tools: Grafana, Jaeger, Kiali) - Visibility into service communication without code changes **Secure** - Secure communication between services (mutual TLS) - Identity + cert for each service .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Layer7 routing and traffic management. This allows you to make decisions based on the content of the message or request - URL, cookies, headers, method, etc. So you can say for example if the request contains a specific header with a specific value, then I want that request to go to the latest version of the service for example. Or you can also say, I want to split 50% of the traffic to the previous version of the service and 50% of the traffic to the latest verison of the service. In addition to the traffic routing, you can also define circuit breakers and request timeouts and retries. Since every request passes through the proxy, all the telemetry gets collected for free. These metrics and telemtry can give you visibility into the communication that happens between your services. Istio also comes with preinstalled and pre-configured tools like Grafana, Jaeger and Kiali. Finally security. Istio generates an identity and certificate for each service and you can use that to secure the communication between services using mutual TLS. Additionaly, you can also control which services are allowed to communicate. --- class: pic ![Istio Architecture](./images/istio-arch.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: traffic-management class: center, middle # Traffic Management .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- class: pic ![Istio Pilot](./images/sm-pilot.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Here are some of the things we need to know: Each pod has an injected Envoy proxy next to it Envoy proxy intercepts and forwards all traffic between the caller and the service All rules come from the Pilot and are pushed to the Envoy proxies, and proxies reconfigure themselves automatically We write the rules using YAML and send them to the cluster --- class: pic ![Istio Pilot](./images/sm-pilot-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Pilot maintains a platform independent abstract model of services in the mesh. It uses platform adapters to translate and populate it’s model. For example, the Kubernetes platform adapter monitors pods, ingresses and other traffic related resources that store traffic management rules and translates that data into abstract representation. The envoy API then uses this abstract model to create Envoy specific configuration that gets pushed to all the proxies running in the mesh. Upon a change in their configuration, the proxies reconfigure themselves automatically. --- class: pic ![Traffic Split](./images/sm-traffic-split.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Using this design, the callers of the service have no clue about different versions of the destination service. Service A does not know there are two version of service B. it just calls Service B and the proxies decide (based on the routing rules) how and where to route the traffic. --- ## Service Mesh - Istio **Traffic Management Resources** - Gateway - VirtualService - DestinationRule - ServiceEntry - Sidecar .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? There are 4 Istio resources involved in traffic management in Istio. Using these rules you can tell Istio how to deal with the traffic. --- ## Service Mesh - Virtual Service ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: serviceb-vs spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: * host: service-b.default.svc.cluster.local * subset: v1 * weight: 98 - destination: host: service-b.default.svc.cluster.local subset: v2 weight: 2 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Virtual service defines the rules that control how requests for a service are routed within an Istio service mesh. You can use a Virtualservice to route requests to different versions of a service or to a completely different service. You can do the routing based on the request source and destination, http paths and header fields, and weights associated with individual service versions. --- ## Service Mesh - Destination Rule ```yaml apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: serviceb-dr spec: host: service-b.default.svc.cluster.local * subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 * trafficPolicy: tls: mode: ISTIO_MUTUAL ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Destination rule configures set of policies that get applied to a request after virtual service does the routing. For example, we can apply the TLS policy as well as define the subsets. --- class: pic ![Subsets - 1](./images/sm-subsets-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Let’s look at an example We have 4 pods labelled app=hello and version=v1 and version=v2. These pods are created by their own deployment- v1 deployment and v2 deployment. Lastly, we have a service-b that uses ‘app=hello’ as a selector. --- class: pic ![Subsets - 2](./images/sm-subsets-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? If we make a request to the service-b, we will either get a response from a v1 pod or from a v2 pod. Because the app=hello selector returns all pods. --- ## Destination rule ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: serviceb-dr spec: host: service-b.default.svc.cluster.local subsets: * - name: v1 * labels: * version: v1 * - name: v2 * labels: * version: v2 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? With the subsets we can say that we want to apply additional labels to the service selector, depending on which subset we are routing to. We can define a subset called v1 and v2 that apply an additional label (e.g. version=v1) when traffic hits the service. --- ## Virtual service ```line-numbers ... http: - route: - destination: * host: service-b.default.svc.cluster.local * subset: v1 * weight: 30 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Once the traffic gets routed to the destination that we defined in the virtual service, we also specify the subset of the service we want to route the traffic to. --- class:pic ![Subsets - 3](./images/sm-subsets-3.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? so, the routing occurs in the virtualservice we want to route the traffic to service-b and subset v1; this means that in addition to applying the app=hello label, the version=v1 label is going to be applied as well and we will only get responses back from the pods labelled with v1 version. --- class:pic ![Subsets - 4](./images/sm-subsets-4.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Similarly, when the traffic gets routed to subset V2, the additional label will get applied and the traffic will get routed to the pods that are labelled with v2. --- class:pic ![Subsets - 5](./images/sm-subsets-5.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? --- ## Service Mesh - Service Entry ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: movie-db spec: hosts: * - api.themoviedb.org ports: * - number: 443 name: https protocol: HTTPS * resolution: DNS * location: MESH_EXTERNAL ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Service entry is used to enable requests to service outside of the mesh. --- ## Service Mesh - Gateway ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: gateway spec: selector: * istio: ingressgateway servers: - port: * number: 80 name: http protocol: HTTP hosts: * - "hello.example.com" ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Gateway – is used for defining a load balancer that operates at the edge of the mesh for ingress and egress traffic. You can use this resource to expose the services to the outside world. You can also attach the gateways to a VirtualSErvice (if you want to make it publicly accessible). Istio also installs an ingress and egress gateway that acts as a controller for the gateway resource (similar to the controller for ingress resource in kubernetes) --- class:pic ![Gateway - 1](./images/gateway-1.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Let's look at an example. We have Kubernetes cluster with a bunch of service running inside and we want to make `service-a` accessible from the public internet. --- class:pic ![Gateway - 2](./images/gateway-2.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? As part of Istio, an ingress gateway gets installed (Envoy proxy) and it is exposed on a an external IP address. If you are using a cloud-hosted Kubernetes cluster this will be an actual Load balancer that gets provisioned at the cloud provider and it gives you an external IP address you can use to access services in your cluster. --- class:pic ![Gateway - 3](./images/gateway-3.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Once you have the external IP address you can create a Gateway resource where you define the port and the hosts where you want gateway to be accessible on. Practically speaking, the host is the domain or subdomain name that you want to use and you register at your domain registrar. Once that's set up, you can go to the DNS settings on your domain registrar and enter that external IP address as an A record and set up a CNAME record for your domain as well. With this you are saying that whenever someone types in `www.service-a.com` you want it to resolve to that IP address. --- class:pic ![Gateway - 4](./images/gateway-4.png) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Finally you need to tell the ingress gateway to which service inside Kubernetes you to route the traffic to. Otherwise the traffic will just go nowhere. The way you set that up is by attaching the gateway to the virtual service using the gateways field. Once a virtual service has an attached gateway, it will make that service accessible to the outside. --- ## Service Mesh - Sidecar ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: Sidecar metadata: name: default namespace: prod-us-west-1 spec: * egress: * - hosts: * - 'prod-us-west-1/*' * - 'prod-apis/*' * - 'istio-system/*' ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Sidecar resource is the recent addition to Istio. It is used to control the configuration of the sidecar proxy – specifically the inbound and outbound communication. By default, the proxies in the mesh are configured in such as way that they accept any traffic and can be reach from any service on all ports. With the sidecar resource, you can control and fine tune the set of ports, protocols that proxy accepts when it forwards the traffic to your service. Additionally, you can also restrict the service proxy can reach. For example, with yaml snippet we are declaring a sidecar resource in the prod-us-west-1 namespace that configures all Envoy proxies in hat namespace and allows egress (outgoing) traffic to services running in the prod-us-west-1, prod-apis and istio-system namespaces. Similarly, you could also define the ingress portion and control which traffic can get in to the services. --- ## Service Mesh - Traffic Management - Define subsets in DestinationRule - Define route rules in VirtualService - Define one or more destinations with weights .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? So how can we split the traffic to multiple versions of the same service? As a first step, we need to determine the subsets or the way we want to split the traffic. Are we splitting it based on the version? Or based on the environment or any other label or key/value pair? Once we have the subsets defined, we can create a route rule in Virtual service. Under each route rule, we can have one or more destinations that the traffic gets routed to. These destinations will use the subsets we defined in the destination rule. Additionally, we need to set the weights to all destinations to define how the traffic should be split. --- class: center, middle # Demo ### Istio Traffic Routing .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: resiliency class: middle, center # Service Resiliency .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- class: middle, center # Resiliency ### Ability to .highlight[recover from failures] and .highlight[continue to function] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? what is resiliency? It’s not about avoiding failures, it’s responding to failures in such a way that there’s no downtime or data loss. --- class: middle, center ### Return the service to a .highlight[fully functioning state] after failure .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? A simple goal of resiliency is to return the service to a fully functioning state after a failure occurs --- ## Resiliency **High availability** - Healthy - No significant downtime - Responsive - Meeting SLAs **Disaster recovery** - Design can't handle the impact of failures - Data backup & archiving .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? There are two aspects to the service resiliency One is high availability which basically says that : Service is running in a healthy state There’s no significant downtime Service is responsive and users can interact with it Service is meeting SLAs etc. Second aspect is disaster recovery Disaster recovery is all about how to recover from rare but major incidents (involves data backup, archiving, etc.) We can also say that disaster recovery starts when high availability design can’t handle the impact of faults --- ## Resiliency Strategies - Load Balancing - Timeouts and retries - Circuit breakers and bulkhead pattern - Data replication - Graceful degradation - Rate limiting .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Next I am going to talk about some strategies that can help you with service resiliency Load balancing: any service should be able to scale out by adding more instances. There are different Load balancing algorithms that can help with achieving better resiliency: e.g. round robin, random, least conn Timeouts & Retries: - Crucial element in making services available; network introduces a lot of unpredictable behavior, one of them being latency ,so how long do you wait for the service response? Is something wrong or is service just slow? Do you just wait? Waiting == bad - there’s usually a customer at the other end that’s waiting for something; also waiting uses resources and might cause other services to wait, leading to potential cascading failures Solution: Always have timeouts! Retries: - Even with timeouts in place, network could be experiencing transient failures and if only we retried one more time, the call might have succeeded. Guideline: make at least a couple of attempts before calling it quits and giving up on the service Some examples of retries: hardcoded delay between each retry, exponential back-off Circuit breakers & bulkhead pattern - we mentioned the cascading failures that can occur in systems; circuit breaker pattern is what you can use to prevent additional strain to the system and prevent cascading failures. the way circuit breaker works is that it encapsulates the functionality or services that needs to be protected. You need to configure a threshold in your circuit breaker that will cause it to trip. For example: 10 consecutive failures in 5 seconds, or more than 2 connections, etc. Once those values are exceeded, circuit breaker trips and it removes the failing service from the load balancing pool for a predefined amount of time (e.g. a minute, 5 minutes, ...). -Bulkhead pattern: isolate critical resources, so failures in one subsystem can’t cascade and cause failures in other parts of the application Data replication: - General strategy for handling non-transient failures in data store; many storage technologies have the replication built in; Degrade gracefully: - If one service fails, the whole system should continue working; for example, you can provide an acceptable user experience even without all services up and running Use cached data, put items on the queue, show the error message, etc. Rate limiting: - Make sure you are rate limiting the requests, so no one user can create excessive load --- ## Service Mesh - Timeouts ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b.default.svc.cluster.local subset: v1 * timeout: 5s ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? By default any timeouts for HTTP requests in Istio are disabled, but we can set or enable them by adding a timeout field to the Virtual Service. You can also dynamically overwrite the timeout from your service using the above header value --- ## Service Mesh - Retries ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b.default.svc.cluster.local subset: v1 * retries: * attempts: 3 * perTryTimeout: 3s * retryOn: gateway-error,connect-failure ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Similarly, we can define the retries in the virtual service. We define the number of retries and the timeout for each retry --- this is so you don’t put unnecessary strain on the service and to minimize the impact on an overloaded service You can also configure the types of HTTP codes to retry on, for example gateway error, connection failure, reset, 500. Additionally, you can also define the status codes on which to retry using a header. --- ## Service Mesh - Circuit Breakers ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: service-b spec: host: service-b.default.svc.cluster.local * trafficPolicy: * tcp: * maxConnections: 1 * http: * http1MaxPendingRequests: 1 * maxRequestsPerConnection: 1 * outlierDetection: * consecutiveErrors: 1 * interval: 1s * baseEjectionTime: 3m * maxEjectionPercent: 100 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Circuit breaker settings can be applied on the destination rule for the service. Compared to retries and timeouts there are way more knobs you can adjust. First thing that needs to be defined is the connection pool settings. You can apply them both at the TCP level and at the HTTP level. For TCP, the following fields can be set: max connections Connection timeout TCP keep alive settings (number of probes to send before deciding connection is dead, time and interval) For HTTP: http1MaxPendingRequests/http2MaxRequests -> maximum number of requests (Default is 1024) maxRequestsPerConnection -> max number of requests per connection maxRetries -> max number of retries that can be outstanding idleTimeout -> period in which there are no active requests. When idle timeout is reached, the connection is closed With these values set, you are limiting the connection pool and requests that will be made to the service; if we go over these values we are short-circuiting the circuit breaker. For example, if we’d make more than 1 connection to the service, the circuit break would short circuit calls that didn’t match the config settings. What about if nodes go down in the cluster? Or some of our pods keep restarting? This is where the outlier detection comes into play. The outlier detection can detect when hosts are not reliable and can eject them for a period of time. So if we think about a Kubernetes service that load balances between 10 pods – if some of those pods are unhealthy and the outlier detection detects that, it will exclude or eject those pods from the load balancing pool. Let’s look at the settings: consecutiveErrors: number of errors before a host is ejected from the pool (default value is 5) – 502, 503 or 504 qualify as errors; for TCP, timeouts and connection errors qualify as errors Interval: period between host scanning; for example, the 10 second default value means that the upstream hosts will be scanned every 10 seconds and analyzed for failures baseEjectionTime: this value represents minimum ejection duration (default is 30 seconds) maxEjectionPercent: max % of hosts in the pool that can be ejected – default is 10% -- this means if more than 10% of the hosts are failing, only 10% will be ejected. minHealthPercent: with this value you can control if outlier detection is enabled or not. For example: if you set the minHealthPercent to 50% and the number of unhealthy hosts drops below 50%, the outlier detection is disabled and in that case proxy will load balance across all hosts in the pool, regardless if they are healthy or not --- ## Service Mesh - Delays ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b subset: v1 * fault: * delay: * percentage: 50 * fixedDelay: 2s ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? For example, you can inject a fixed delay for a certain percentage of requests coming to the service. With this you can simulate latency and test how your services behave. In the example above, we are saying that we want 50% percent of incoming requests to be delayed by 2 seconds. --- ## Service Mesh - Aborts ```line-numbers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: service-b spec: hosts: - service-b.default.svc.cluster.local http: - route: - destination: host: service-b subset: v1 * fault: * abort: * percentage: 30 * httpStatus: 404 ``` .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Similarly as with the delays, you can also inject aborts to your services. This example is saying that for 3πserv0% of the requests, the service will return an HTTP status of 404. The best thing about all this: you can combine aborts with delays. And with the routing policies and with conditions. If you think about it, this can be an extremely powerful tool that helps you with testing your services and making them more resilient. --- class: center, middle # Demo ### Service Resiliency .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: security class: middle, center # Security .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- background-image: url(./images/access-control-lock.jpeg) background-size: cover ??? When talking about security, we need to start with the access control systems. --- ## Access Control ## .center[Can a .highlight[principal] perform an .highlight[action] on an .highlight[object]?] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The question access control systems try to answer is: can some principal perform an action on an object?
--
.center[ ### .highlight[Principal] = user ### .highlight[Action] = delete ### .highlight[Object] = file ] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? An example would be: can a user delete a file, or read or execute a script. Or if I put it in Istio/Kubernetes/cloud-native/microservices terms: can Service A perform an action on Service B When talking about access control, we need both authentication AND authorization. --- ## Authentication (authn) - Verify credential is valid/authentic - Istio: X.509 certificates - Identity encoded in certificate .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Authentication is about the principal and is the act of taking a credential and making sure that it's valid and authentic. The credential in Istios case is a X.509 certificate and it has an encoded identity in the certificate. Once an authentication is performed, we are talking about authenticated principal. --- ## Authorization (authz) - Is principal allowed to perform an action on an object? - Istio: RBAC policies - Role-based access control .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? What authorization tries to do is to answer the question if a principal is allowed to perform the action on the object. You can be authenticated, however you still might not be allowed to perform an action. Authorization of service-to-service communication in Istio is configured using RBAC (role based access control) policies. --- class: center, middle ## .highlight[Authentication] and .highlight[authorization] work together .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? We need both authorization and authentication, otherwise have one without another is useless. If you only authenticate the user without authorization, they can do whatever they want - perform any action on any objects. Similarly, if you only authorize the requests, a user can pretend to be someone else and again perform thwe actions. --- ## Identity - SPIFFE - SPIFFE (Secure Production Identity Framework for Everyone) - Specially formed X.509 certificate with an ID (e.g. `spiffe://cluster.local/ns/default/sa/default`) - Kubernetes: service account is used .footnote[https://spiffe.io/] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The next thing that we need is tofigure out what is the identity in Istio. Identity is quite important as all functions inside the mesh rely on understanding how to identify the services. Istio uses SPIFFE which provides a secure identity using a specially formed X.509 certificate. Istio implements this to issue the identities. It creates the X.509 certificates that have the subject alternative name (SAN) set to a URI that describes the service. In Kubernetes case, Istio will use pods' service account as its identity and encode it into a URI. This means that if you haven't explicitly provided a service account you want your pod to use, it will use a default service account from the namespace. --- ## Mutual TLS (mTLS) **Flow** 1. Traffic from client gets routed to the client side proxy 2. Client side proxy starts mTLS handshake - Secure naming check: verify service account in the cert can run the target service 3. Client and server side proxies establish mTLS connection 4. Server side proxy forwards traffic to the server service .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Let's talk about how are these certificates used practically to identify the parties that communicate. First thing that comes to your mind when talking about TLS is HTTPS. You open a browser to go to https://hello.com - the browser fires off an HTTPS request and waits for the service. When the browser tries to connect, the server presents the certificate with an identity (hello.com) that's signed by something browser trusts. The browser then validates the certificate, authenticates the identity of it and allows the connection to be establish. Finally, a set of keys are generated for the connection to enable encryption. An real-world example here would be me presenting a passport to a customs officer - passport has my identity and it's signed by an authority that the customs officer trusts. Now mTLS is where both client and the server present the certificates. This allows both client to verify the server as well as server to verify the client. If I go back to the customs officer example, it's me showing them my passport, but also them showing me their identification as well. Traffic in between services gets routed through the Envoy proxies. The traffic leaving the service goes to the client sidecar proxy. The proxy then starts the mTLS handshake with the server side Envoy proxy. During this handshake, the proxy does a secure naming check to make sure that the service account in the server certificate is authorized to run the target service. The proxies establish the mTLS connection and finally the server side proxy can forward the traffic to the service. --- ## Configuring mTLS/JWT - Policy resource (`authentication.istio.io/v1alpha1.Policy`) - Scope: - Mesh < namespace < service - Also supports JWT .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? The main resource used for configuring the mTLS is called Policy. It allows you to require strict mTLS between services, make it optional - meaning that you can either use plaintext or TLS or disable it. All this on either whole service mesh level(i.e. everything within the mesh or cluster), namespace or for each service separately. Depending on which policies are set, the namespace policy takes precedence over the mesh policy, while the per service policy still has a higher precedence. Having the ability to make mTLS optional makes it easier to gradually enable mTLS for existing service you have. Istio also supports doing end-user authentication using JWT tokens. Where you can use the fields in JWT to authenticate the user --- ## Configuring authorization - Who can talk to whom - Uses RBAC (role-based access control) - Service role - Actions that can be performed on service by any principal with the role - Service role binding - Assigns roles to principals (principals = service identities = ServiceAccounts) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Once you have an authentication policy in place, you want to control which services can talk to each other. To do that you need to describe a service-to-service communication policy. This is done using RBAC - role based access control. There are two objects you use to define a policy: 1. Service role: this one describes actions that can be performed on a set of services with any principal that has the role 2. Service role binding: this one is used to assign roles to a set of principals. --- ## Configuring RBAC - ClusterRbacConfig resource (`rbac.istio.io/v1alpha1`) - Multiple modes: - On, off - On with inclusion, on with exclusion .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] ??? Just like with mTLS, there's an ClusterRbacConfig resource that can be used to turn on or enable RBAC. It supports multiple modes: - ON: requires RBAC policies for communication - OFF: RBAC is not required, this is the default - on with inclusion: you can define the set of namespaces/services in the policy where RBAC is required - on with exclusion: you can define the set of namespace/services in the policy where RBAC is not required. So RBAC is on for everything except the namespaces declared in the policy To do a gradual roll out of RBAC you should use ON with inclusion mode first. Then as you define policies for your services, add it to the inclusion list. This allows you do enable RBAC for each service separately, without breaking anything. --- ## How to get started? - Do you need a service mesh? - Start small and slow: - Learn and understand the resources - Apply to a subset of services - Understand the metrics, logs, dashboards .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- ## Resources - Kubernetes (OKE) on Oracle Cloud - https://cloud.oracle.com - Kubernetes - https://kubernetes.io - Istio - https://istio.io .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name: thankyou ## Thank you **Slides**: https://slides.peterj.dev/oracle-doag-2019 **Contact** - [@pjausovec](https://twitter.com/pjausovec) - [https://peterj.dev](https://peterj.dev) .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)] --- name:toc ### Table of Contents .small[ - [Service Mesh](#service-mesh-overview) - [Traffic Management](#traffic-management) - [Resiliency](#resiliency) - [Security](#security) ] .oracle-logo[ ] .twitter[![Twitter](./images/twitter.png) [@pjausovec](https://twitter.com/pjausovec)]