Logo

Getting Started

Nobl9 TechDoc: Getting Started

Back to Nobl9 Documentation

Getting Started

Welcome, and thank you for choosing Nobl9. This Getting Started Guide shows you how to start using the Nobl9 platform. Learn to access your account, connect to metrics, and work through your first service level objective (SLO).

Prerequisites

The following software, tools, or actions are needed to ensure a great onboarding experience:

  1. Use a Mac, Linux, or Windows machine to run sloctl. The Nobl9 command-line utility makes it easier to create and update many SLOs at once.

  2. Select a Kubernetes cluster or any Docker environment (or use a Docker environment on your local machine to start) to run the Nobl9 agent, which collects service level indicator metrics from your existing metrics system (such as Datadog, New Relic, or Prometheus).

  3. Verify that you received an email from Nobl9 to set up your user account.

Setting Up a User Account

As a Nobl9 user, you will receive an invitation email with an activation link.

💡 If you were invited to Nobl9 and did not receive an invitation email, contact support@nobl9.com.

  1. Locate the Nobl9 user invitation sent to your email.

  2. Click the link to accept the invitation and follow the instructions to set up your user account. After setting up your account, a confirmation page appears which asks you to return to the login screen.

  3. Return to the login screen by visiting https://app.nobl9.com in your browser.

Logging into Nobl9 User Interface

You will need to log into the Nobl9 web user interface (UI) using the credentials created during the account setup.

  1. Go to https://app.nobl9.com .

  2. Enter your email address and password created during the account setup, or click ‘Login with Google’ if you have a single sign-on (SSO) account.

Setting Up sloctl

The command-line interface (CLI) for Nobl9 is named sloctl. You can use the sloctl CLI when creating or updating multiple SLOs at once, creating or updating multiple thresholds, or when updating SLOs as part of CI/CD.

The web user interface is available to give you an easy way to create and update SLOs, and to familiarize you with the features available in Nobl9, while sloctl aims to provide a systematic and/or automated approach to maintaining SLOs as code.

For detailed instructions on installing, configuring, and using sloctl, refer to the Sloctl User Guide section of our documentation.

Defining a Data Source and Running a Nobl9 Agent

Running data collection through an agent means that special inbound access to your network is not needed and Nobl9 doesn’t have to store credentials to your other metric systems.

Use the following steps to define a data source and run a Nobl9 Agent.

  1. Go to the Integrations icon in web UI.

  2. Select the Sources tab to define a data source.

  3. Follow the on-screen instructions to run the agent. Recommendations:

    1. Samples are provided for a Kubernetes Deployment and a simple Docker run command

    2. Run the agent(s) in production clusters or in a location that can access production metrics

    3. Consider running the agent in your local Docker environment at first for ease of troubleshooting

Using Resources in Nobl9

The following section is an overview of resources in the Nobl9 platform. Refer to the YAML guide to see how Nobl9 resource configuration is represented in the sloctl API, and how you can express them in .yaml format.

Projects

Projects are the primary logical grouping of resources in the Nobl9 platform. All Nobl9 resources, such as data sources, SLOs, and alerts, are created within a project. Access controls at the project level enable users to control who can see and change these resources. For example, you can allow all of your users to view the SLOs in a given project, but only a few users to make changes.

Before you can start creating SLOs, you have to create a project to put them in. Follow the below instructions to create a project:

  1. Go to Catalog > Projects.

  2. Click the + button.

  3. Enter a Display Name.
    The Name field will automatically be populated with a Kubernetes-style name, which you can modify if you like. We use this in our YAML configurations to ensure the uniqueness of object names.

  4. Add a Description.

  5. Click the ‘Create Project’ button.

Services

Nobl9 uses services to represent distinct boundaries in your application. A service can be a user journey, internal or external API, or some other boundary—anything you care about setting a service level objective for.

For example, in a service desk application, one service might be creating a new ticket. That service may rely on a user service, a queue, a notification service, and a database service, all of which could additionally be defined as additional services in Nobl9.

Service may be composed of other services. When adding a service you can use labels to add additional metadata such as team ownership or upstream/downstream dependencies. Services can either be manually added in via the user interface or YAML or can be automatically discovered from a data source based on rules.

A service can have one or more SLOs defined for it. Every SLO created in Nobl9 must be tied to a service.

Data Sources

You can have multiple data sources for your service. Configure Nobl9 to connect to these (one or multiple) data sources to collect all service data in real-time. Nobl9 can run sources using two connection methods:

Use the Direct connection if you want Nobl9 to access your server by connecting directly over the internet. This method may be less secure as you will need to open the port the data source is running on for Nobl9 to connect.

Use the Agent method if you want to run an agent alongside your server. You will not need to directly expose your server to Nobl9, the agent will periodically connect to Nobl9 using an outbound connection.

Service Level Objectives

With a service label and its data sources configured, you can define the thresholds for Service Level Indicators (SLIs). Together with a time window, they create a unique SLO.

SLOs allow you to define the reliability of your products and services in terms of customer expectations. You can create SLOs for user journeys, internal services, or even infrastructure. For more background on SLOs, go to our guide on Creating Your First SLO.

Complete the following steps to create an SLO in the Nobl9 UI:

Navigate to the Service Level Objectives icon and click the + icon to start the SLO Wizard and follow the five-step configuration in the SLO Wizard to create an SLO:

  1. Select a Service from the drop-down list to tag the service this SLO applies to.

  2. Go to Select Data Source and Metric step.

  3. Click the Data Source drop-down list to choose a data source.

  4. Select a type of Metric. and enter a Query. ( Refer to the example below of what can be queried.)

    • A Threshold Metric is a single time series evaluated against a threshold.

    • A Ratio Metric allows you to enter two time series to compare (for example, a count of good requests and total requests).

  5. Choose a Rolling or Calendar-Aligned time window in the Define Time Window section.

    • Rolling time windows are better for tracking the recent user experience of a service.

    • Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, e.g every calendar month, or every quarter.

  6. Define Error Budget Calculation and Objectives. Click the drop-down list in Error Budget Calculation Method and select either Occurrences or Time Slices. For more information, see the use case examples located in the last section of the Getting Started Guide.

  7. Name your objective in the Add Name, Alert Policy & Tags section.

  8. Click the drop-down list next to Alert Policies to send an alert.

    If no alerts were created, navigate to the Alert Policies page and click the + icon to start the Alert Policy Wizard.

  9. Create a Description.

Document relevant details or metadata for the SLI and SLO as descriptions. As a best practice, we recommend adding the team or owner details, or the purpose of creating this specific SLO. Adding a description may help to provide a quick context about this SLO to any team member.

The following are some examples of what can be queried:

Description Result
Web service or API HTTPS responses with 2xx and 3xx status codes.
In a queue consumer Successful processing of a message.
In a serverless and function-based architectures Successful completion of an invocation.
In batch Normal exit (for example, rc == 0) of the driving process or script.
In a browser application Completion of a user action without yamlScript errors.

Creating Alert Policies

Once you have created your SLO, you can configure an alert policy and alert method for it. An alert policy expresses a set of conditions you want to track or monitor. The conditions for an alert policy define what is monitored and when to activate an alert. When the performance of your Service is declining, Nobl9 will send you a notification to a pre-defined channel.

Alerts in Nobl9 can be sent to several different tools, including PagerDuty, MS Teams, Slack, Discord, Jira, Opsgenie, and ServiceNow. Email alerts are also supported, and you can use webhooks to send alerts to any service that has a REST API, such as FireHydrant, Asana, xMatters, and many more.

For details on how to set up an alert, refer to the Alerting section of Nobl9 documentation.

Follow the steps below to set up an alert policy using the Alert Policy Wizard in the Web UI.

  1. Select the Alerts icon and click the + icon to enter the Alerts Policy Wizard.

  2. Define an Alert Condition by selecting one or more of the boxes.

    Defined Alert Condition monitors the behavior and volatility of a data source. You can set a maximum of three alert conditions. Create another alert policy if you want to set more than three alert
    conditions.

    The Error budget relies on the targets set up in your SLA and SLOs. Error budgets measure the maximum amount of time a system can fail without repercussions.

    The Remaining error budget is the amount leftover from the error budget set up in the SLO.

    The Error budget burn rate measures how fast the error budget is disappearing. The numbers in the error budget burn rate must match the numbers in the error budget.

  3. Go to the Add Alert Policy Name and Severity tab.

    a. Enter a Project.

    b. Enter a Display name (optional).

    c. Enter a Name for the alert. (mandatory)

    d. Set the Severity to high, medium, or low.

    • High: A critical incident with a very high impact.

    • Medium: A major incident with a significant impact.

    • Low: A minor incident with low impact.

  4. Create an Alert Policy Description (optional).

  5. Go to Select Alert Method and select the box to set up a webhook.

Set up the integration in YAML using sloctl to apply changes. The webhook integration will then be available in the Alert Wizard in the Web UI.

Use Cases of SLO Configurations

The following examples explain how to create SLOs for sample services using sloctl.

A Typical Example of a Latency SLO for a RESTful Service

First, we want to pick an appropriate service level indicator to measure the latency of response from a RESTful service. In this example, let’s assume our service runs in the NGINX web server, and we’re going to use a threshold-based approach to define acceptable behavior. For example, we want the service to respond in a certain amount of time.

💡 Note: There are many ways to measure application performance. In this case we’re giving an example of server-side measurement at the application layer (NGINX). However, it might be advantageous for your application to measure this metric differently. For example, you might choose to measure performance at the client, or at the load balancer, or somewhere else. Your choice depends on what you are trying to measure or improve, as well as what data is currently available as usable metrics for the SLI.

The threshold approach uses a single query, and we set thresholds or breaking points on the results from that query to define the boundaries of acceptable behavior. In the SLO YAML, we specify the indicator like this:

   indicator:
      metricSource:
        name: my-prometheus-instance
        project: default
      rawMetric:
        prometheus:
          promql: server_requestMsec{job="nginx"}

In this example use Prometheus. The concepts are similar for other metrics stores. We recommend running the query against your Prometheus instance and reviewing the result data, so you can verify that the query returns what you expect, and so that you understand the units: whether it’s returning latencies as milliseconds or fractions of a second, for example. This query seems to return data between 60 and 150 milliseconds with some occasional outliers.

Choosing a Time Window

We need to choose whether we want a rolling or calendar-aligned window.

For our RESTful service, we will be using the Rolling window SLO primarily to measure recent user experience. This will help us make decisions about the risk of changes, releases, and how best to invest our engineering resources on a week-to-week or sprint-to-sprint basis. We want the “recent” period that we’re measuring to trail back long enough that our users would consider its recent behavior.

We choose to go with a 28-day window, which has the advantage of containing an equal number of weekend days and weekdays as it rolls.

  timeWindows:
    - count: 28
      isRolling: true
      period:
        begin: "2020-12-01T00:00:00Z"
      unit: Day

Choosing a Budgeting Method

There are two budgeting methods to choose from: Time Slices and Occurrences.

Time Slices

In the Time Slices method, what we count (objective we measure) is how many good minutes were achieved (minutes where our system is operating within defined boundaries), compared to the total minutes in the window.

This is useful for some scenarios, but it has a disadvantage when we’re looking at “recent user experience” as we are with this SLO. The disadvantage is that a bad minute that occurs during a low-traffic period (say, in the middle of the night for most of our users, when they are unlikely to even notice a performance issue) would penalize the SLO the same amount as a bad minute during peak traffic times.

Occurrences

The Occurrences method is well suited to this situation. Occurrences count good attempts (in this example, requests that are within defined boundaries) against the count of all attempts (this means all requests, including requests that perform outside of defined boundaries). Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volume.

    budgetingMethod: Occurrences

Establishing Thresholds

In this example we’ve talked to our product and support teams and can establish the following thresholds:

    - budgetTarget: 0.95
      displayName: Laggy
      value: 100
      op: lte

This threshold requires that 95% of requests are completed within 100ms.

You can name each threshold however you want. We recommend naming them how a user of the service (or how another service that uses this service) might describe the experience at a given threshold. Typically, we use names that are descriptive adjectives of the experience when the threshold is not met. When the threshold is violated, we can say that the user’s experience is “Laggy”.

Let’s define another threshold. In the above threshold, we allow 5% of requests to run longer than 100ms. We want most of that 5%, say 80% of the remaining 5% of the queries to still return within 1/4th of a second (250ms). That means 99% of the queries return within 250ms (95% +4%). Add a threshold like this:

    - budgetTarget: 0.99
      displayName: Slow
      value: 250
      op: lte

This threshold requires that 99% of requests are completed within 250ms.

    - budgetTarget: 0.999
      displayName: Painful
      value: 500
      op: lte

This threshold requires that 99.9% of requests are completed within 500ms.

In sum, our SLO definition for the use cases looks like this:

- apiVersion: n9/v1alpha
  kind: SLO
  metadata:
    displayName: adminpageload
    name: adminpageload
    project: external
  spec:
    alertPolicies: []
    budgetingMethod: Occurrences
    description: ""
    indicator:
      metricSource:
        name: cooperlab
        projects: default
      rawMetric:
        newRelic:
          nrql: SELECT average(duration)  FROM SyntheticRequest WHERE monitorId = 339adbc4-01e4-4517-88cf-ece25cb66156'
               
    timeSeries: 
      objectives:
      - displayName: ok
        op: lt
        tag: external.adminpageload.70d000000
        target: 0.98
        value: 70
      
      - displayName: laggy
        op: lt
        tag: external.adminpageload.85d000000
        target: 0.99
        value: 85
      
      - displayName: poor
        op: lt
        tag: external.adminpageload.125d000000
        value: 125
        service: venderportal
 
    timeWindows:
      - count: 1
        isRolling: true
        period:
          begin: "2021-03-08T06:46:08Z"
    end: "2021-03-08T07:46:08Z"
        unit: Hour