Elastic Observability Labs - Log Analytics

3 models for logging with OpenTelemetry and Elastic

Tue, 27 Jun 2023 00:00:00 GMT

Arguably, OpenTelemetry exists to (greatly) increase usage of tracing and metrics among developers. That said, logging will continue to play a critical role in providing flexible, application-specific, event-driven data. Further, OpenTelemetry has the potential to bring added value to existing application logging flows:

Common metadata across tracing, metrics, and logging to facilitate contextual correlation, including metadata passed between services as part of REST or RPC APIs; this is a critical element of service observability in the age of distributed, horizontally scaled systems
An optional unified data path for tracing, metrics, and logging to facilitate common tooling and signal routing to your observability backend

Adoption of metrics and tracing among developers to date has been relatively small. Further, the number of proprietary vendors and APIs (compared to adoption rate) is relatively large. As such, OpenTelemetry took a greenfield approach to developing new, vendor-agnostic APIs for tracing and metrics. In contrast, most developers have nearly 100% log coverage across their services. Moreover, logging is largely supported by a small number of vendor-agnostic, open-source logging libraries and associated APIs (e.g., Logback and ILogger). As such, OpenTelemetry’s approach to logging meets developers where they already are using hooks into existing, popular logging frameworks. In this way, developers can add OpenTelemetry as a log signal output without otherwise altering their code and investment in logging as an observability signal.

Notably, logging is the least mature of OTel supported observability signals. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.

The intent of this article is to explore the current state of the art of OpenTelemetry logging and to provide guidance on the available approaches with the following tenants in mind:

Correlation of service logs with OTel-generated tracing where applicable
Proper capture of exceptions
Common context across tracing, metrics, and logging
Support for slf4j key-value pairs (“structured logging”)
Automatic attachment of metadata carried between services via OTel baggage
Use of an Elastic^® Observability backend
Consistent data fidelity in Elastic regardless of the approach taken

OpenTelemetry logging models

Three models currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol
Write logs from your service to a file scraped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol
Write logs from your service to a file scraped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol

Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.

Logging vs. span events

It is worth noting that most APM systems, including OpenTelemetry, include provisions for span events. Like log statements, span events contain arbitrary, textual data. Additionally, span events automatically carry any custom attributes (e.g., a “user ID”) applied to the parent span, which can help with correlation and context. In this regard, it may be advantageous to translate some existing log statements (inside spans) to span events. As the name implies, of course, span events can only be emitted from within a span and thus are not intended to be a general purpose replacement for logging.

Unlike logging, span events do not pass through existing logging frameworks and therefore cannot (practically) be written to a log file. Further, span events are technically emitted as part of trace data and follow the same data path and signal routing as other trace data.

Polyfill appender

Some of the demos make use of a custom Logback “Polyfill appender” (inspired by OTel’s Logback MDC), which provides support for attaching slf4j key-value pairs to log messages for models (2) and (3).

Elastic Common Schema

For log messages to exhibit full fidelity within Elastic, they eventually need to be formatted in accordance with the Elastic Common Schema (ECS). In models (1) and (2), log messages remain formatted in OTel log semantics until ingested by the Elastic APM Server. The Elastic APM Server then translates OTel log semantics to ECS. In model (3), ECS is applied at the source.

Notably, OpenTelemetry recently adopted the Elastic Common Schema as its standard for semantic conventions going forward! As such, it is anticipated that current OTel log semantics will be updated to align with ECS.

Getting started

The included demos center around a “POJO” (no assumed framework) Java project. Java is arguably the most mature of OTel-supported languages, particularly with respect to logging options. Notably, this singular Java project was designed to support the three models of logging discussed here. In practice, you would only implement one of these models (and corresponding project dependencies).

The demos assume you have a working Docker environment and an Elastic Cloud instance.

git clone https://github.com/ty-elastic/otel-logging
Create an .env file at the root of otel-logging with the following (appropriately filled-in) environment variables:

# the service name
OTEL_SERVICE_NAME=app4

# Filebeat vars
ELASTIC_CLOUD_ID=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)
ELASTIC_CLOUD_AUTH=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)

# apm vars
ELASTIC_APM_SERVER_ENDPOINT=(address of your Elastic Cloud APM server... i.e., https://xyz123.apm.us-central1.gcp.cloud.es.io:443)
ELASTIC_APM_SERVER_SECRET=(see https://www.elastic.co/guide/en/apm/guide/current/secret-token.html)

Start up the demo with the desired model:

If you want to demo logging via OTel APM Agent, run MODE=apm docker-compose up
If you want to demo logging via OTel filelogreceiver, run MODE=filelogreceiver docker-compose up
If you want to demo logging via Elastic filebeat, run MODE=filebeat docker-compose up

Validate incoming span and correlated log data in your Elastic Cloud instance

Model 1: Logging via OpenTelemetry instrumentation

This model aligns with the long-term goals of OpenTelemetry: integrated tracing, metrics, and logging (with common attributes) from your services via the OpenTelemetry Instrumentation libraries, without dependency on log files and scrappers.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a “Southbound hook” to Logback via the OTel Logback Appender, which injects ServiceName, SpanID, TraceID, slf4j key-value pairs, and OTel baggage into log records and passes the composed records to the co-resident OpenTelemetry Instrumentation library. We further employ a custom LogRecordProcessor to add baggage to the log record as attributes.

The OTel instrumentation library then formats the log statements per the OTel logging spec and ships them via OTLP to either an OTel Collector for further routing and enrichment or directly to Elastic.

Notably, as language support improves, this model can and will be supported by runtime agent binding with auto-instrumentation where available (e.g., no code changes required for runtime languages).

One distinguishing advantage of this model, beyond the simplicity it affords, is the ability to more easily tie together attributes and tracing metadata directly with log statements. This inherently makes logging more useful in the context of other OTel-supported observability signals.

Architecture

Although not explicitly pictured, an OpenTelemetry Collector can be inserted in between the service and Elastic to facilitate additional enrichment and/or signal routing or duplication across observability backends.

Pros

Simplified signal architecture and fewer “moving parts” (no files, disk utilization, or file rotation concerns)
Aligns with long-term OTel vision
Log statements can be (easily) decorated with OTel metadata
No polyfill adapter required to support structured logging with slf4j
No additional collectors/agents required
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

Not available (yet) in many OTel-supported languages
No intermediate log file for ad-hoc, on-node debugging
Immature (alpha/experimental) Unknown “glare” conditions, which could result in loss of log data if service exits prematurely or if the backend is unable to accept log data for an extended period of time

Demo

MODE=apm docker-compose up

Model 2: Logging via the OpenTelemetry Collector

Given the cons of Model 1, it may be advantageous to consider a model that continues to leverage an actual log file intermediary between your services and your observability backend. Such a model is possible using an OpenTelemetry Collector collocated with your services (e.g., on the same host), running the filelogreceiver to scrape service log files.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a MDC Appender for Logback (Logback MDC), which adds SpanID, TraceID, and Baggage to the Logback MDC context.

Notably, no log record structure is assumed by the OTel filelogreceiver. In the example provided, we employ the logstash-logback-encoder to JSON-encode log messages. The logstash-logback-encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Notably, logstash-logback-encoder doesn’t explicitly support slf4j key-value pairs. It does, however, support Logback structured arguments, and thus I use the Polyfill Appender to convert slf4j key-value pairs to Logback structured arguments.

We then configure the OTel Collector to scrape this log file (using the filelogreceiver). Because no assumptions are made about the format of the log lines, you need to explicitly map fields from your log schema to the OTel log schema.

From there, the OTel Collector batches and ships the formatted log lines via OTLP to Elastic.

Architecture

Pros

Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

All the headaches of file-based logging (rotation, disk overflow)
Beta quality and not yet proven in the field
No support for slf4j key-value pairs

Demo

MODE=filelogreceiver docker-compose up

Model 3: Logging via Elastic Agent (or Filebeat)

Although the second model described affords some resilience as a function of the backing file, the OTel Collector filelogreceiver module is still decidedly “beta” in quality. Because of the importance of logs as a debugging tool, today I generally recommend that customers continue to import logs into Elastic using the field-proven Elastic Agent or Filebeat scrappers. Elastic Agent and Filebeat have many years of field maturity under their collective belt. Further, it is often advantageous to deploy Elastic Agent anyway to capture the multitude of signals outside the purview of OpenTelemetry (e.g., deep Kubernetes and host metrics, security, etc.).

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). As with model 2, we employ OTel’s Logback MDC to add SpanID, TraceID, and Baggage to the Logback MDC context.

From there, we employ the Elastic ECS Encoder to encode log statements compliant to the Elastic Common Schema. The Elastic ECS Encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Similar to model 2, the Elastic ECS Encoder doesn’t support sl4f key-vair arguments. Curiously, the Elastic ECS encoder also doesn’t appear to support Logback structured arguments. Thus, within the Polyfill Appender, I add slf4j key-value pairs as MDC context. This is less than ideal, however, since MDC forces all values to be strings.

From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.We then configure Elastic Agent or Filebeat to scrape the log file. Notably, the Elastic ECS Encoder does not currently translate incoming OTel SpanID and TraceID variables on the MDC. Thus, we need to perform manual translation of these variables in the Filebeat (or Elastic Agent) configuration to map them to their ECS equivalent.

Architecture

Pros

Robust and field-proven
Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Native ECS format for easy manipulation in Elastic
Fleet-managed via Elastic Agent

Cons

All the headaches of file-based logging (rotation, disk overflow)
No support for slf4j key-value pairs or Logback structured arguments
Requires translation of OTel SpanID and TraceID in Filebeat config
Disparate data paths for logs versus tracing and metrics
Vendor-specific logging format

Demo

MODE=filebeat docker-compose up

Recommendations

For most customers, I currently recommend Model 3 — namely, write to logs in ECS format (with OTel SpanID, TraceID, and Baggage metadata) and collect them with an Elastic Agent installed on the node hosting the application or service. Elastic Agent (or Filebeat) today provides the most field-proven and robust means of capturing log files from applications and services with OpenTelemetry context.

Further, you can leverage this same Elastic Agent instance (ideally running in your Kubernetes daemonset) to collect rich and robust metrics and logs from Kubernetes and many other supported services via Elastic Integrations. Finally, Elastic Agent facilitates remote management via Fleet, avoiding bespoke configuration files.

Alternatively, for customers who either wish to keep their nodes vendor-neutral or use a consolidated signal routing system, I recommend Model 2, wherein an OpenTelemetry collector is used to scrape service log files. While workable and practiced by some early adopters in the field today, this model inherently carries some risk given the current beta nature of the OpenTelemetry filelogreceiver.

I generally do not recommend Model 1 given its limited language support, experimental/alpha status (the API could change), and current potential for data loss. That said, in time, with more language support and more thought to resilient designs, it has clear advantages both with regard to simplicity and richness of metadata.

Extracting more value from your logs

In contrast to tracing and metrics, most organizations have nearly 100% log coverage over their applications and services. This is an ideal beachhead upon which to build an application observability system. On the other hand, logs are notoriously noisy and unstructured; this is only amplified with the scale enabled by the hyperscalers and Kubernetes. Collecting log lines reliably is the easy part; making them useful at today’s scale is hard.

Given that logs are arguably the most challenging observability signal from which to extract value at scale, one should ideally give thoughtful consideration to a vendor’s support for logging in the context of other observability signals. Can they handle surges in log rates because of unexpected scale or an error or test scenario? Do they have the machine learning tool set to automatically recognize patterns in log lines, sort them into categories, and identify true anomalies? Can they provide cost-effective online searchability of logs over months or years without manual rehydration? Do they provide the tools to extract and analyze business KPIs buried in logs?

As an ardent and early supporter of OpenTelemetry, Elastic, of course, natively ingests OTel traces, metrics, and logs. And just like all logs coming into our system, logs coming from OTel-equipped sources avail themselves of our mature tooling and next-gen AI Ops technologies to enable you to extract their full value.Interested? Reach out to our pre-sales team to get started building with Elastic!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

The antidote for index mapping exceptions: ignore_malformed

Thu, 03 Aug 2023 00:00:00 GMT

In this article, I'll explain how the setting ignore_malformed can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.

As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.

During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch: index mapping exceptions.

How mappings work

Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called index mapping or simply mapping) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.

In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with Dynamic Mapping.

What happens when data is malformed?

No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.

A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.

The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on discuss.elastic.co. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.

Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since Elasticsearch 2.0. We are talking ancient history here since the latest version of the stack at the time of writing is Elastic Stack 8.9.0.

Let's now dive into how to use this Elasticsearch feature.

A toy use case

To make it easier to interact with Elasticsearch, I am going to use Kibana® Dev Tools in this tutorial.

The following examples are taken from the official documentation on ignore_malformed. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name my-index, but feel free to change that to whatever you like.

First, we want to create an index mapping with two fields called number_one and number_two. Both fields have type integer, but only one of them has _ ignore_malformed _ set to true, and the other one inherits the default value ignore_malformed: false instead.

PUT my-index
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "my-index"
}

To double-check that the above mapping has been created correctly, we can query the newly created index with the command:

GET my-index/_mapping

You should get the following result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

Now we can ingest two sample documents — both invalid:

PUT my-index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo"
}

PUT my-index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo"
}

The document with id=1 is correctly ingested, while the document with id=2 fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"foo\""
    }
  },
  "status": 400
}

Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.

Now that at least one document has been ingested, you can try searching with the following query:

GET my-index/_search
{
  "fields": [
    "*"
  ]
}

Here, the parameter fields is required to show the values of those fields that have been ignored. More on this later.

From the result, you can see that only the first document (with id=1) has been ingested correctly while the second document (with id=2) has been completely dropped.

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": null,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        },
        "fields": {
          "text": ["Some text value"],
          "text.keyword": ["Some text value"]
        },
        "ignored_field_values": {
          "number_one": ["foo"]
        },
        "sort": ["1"]
      }
    ]
  }
}

From the above JSON response, you will notice some things, such as:

A new field called _ _ignored _ of type array with the list of all fields that have been ignored while ingesting documents
A new field called _ ignored_field_values _ with a dictionary of ignored fields and their values
The field called __ source _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.
The field called _ text _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ my-index _ again via the command:

GET my-index/_mapping

You should get this result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        },
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Finally, if you ingest some valid documents like the following command:

PUT my-index/_doc/3
{
  "text":       "Some text value",
  "number_two": 10
}

You can check how many documents have at least one ignored field with the following Exists query:

GET my-index/_search
{
  "query": {
    "exists": {
      "field": "_ignored"
    }
  }
}

You can also see that out of the two documents ingested (with id=1 and id=3) only the document with id=1 contains an ignored field.

{
  "took": 193,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 1,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        }
      }
    ]
  }
}

Alternatively, you can search for all documents that have a specific field being ignored with this Terms query:

GET my-index/_search
{
  "query": {
    "terms": {
      "_ignored": [ "number_one"]
    }
  }
}

The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.

Conclusion

Because we are a big fan of this flag, we've enabled _ ignore_malformed _ by default for all Elastic integrations and in the default index template for logs data streams as of 8.9.0. More information can be found in the official documentation for ignore_malformed.

And since I am personally working on this feature, I can reassure you that it is a game changer.

You can start by setting _ ignore_malformed _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from Elastic Stack 8.9.0.

Unleash the power of Elastic and Amazon Kinesis Data Firehose to enhance observability and data analytics

Thu, 18 May 2023 00:00:00 GMT

As more organizations leverage the Amazon Web Services (AWS) cloud platform and services to drive operational efficiency and bring products to market, managing logs becomes a critical component of maintaining visibility and safeguarding multi-account AWS environments. Traditionally, logs are stored in Amazon Simple Storage Service (Amazon S3) and then shipped to an external monitoring and analysis solution for further processing.

To simplify this process and reduce management overhead, AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to ingest logs into Elastic Cloud in AWS in real time and view them in the Elastic Stack alongside other logs for centralized analytics. This eliminates the necessity for time-consuming and expensive procedures such as VM provisioning or data shipper operations.

Elastic Observability unifies logs, metrics, and application performance monitoring (APM) traces for a full contextual view across your hybrid AWS environments alongside their on-premises data sets. Elastic Observability enables you to track and monitor performance across a broad range of AWS services, including AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Amazon Simple Storage Service (S3), Amazon Cloudtrail, Amazon Network Firewall, and more.

In this blog, we will walk you through how to use the Amazon Kinesis Data Firehose integration — Elastic is listed in the Amazon Kinesis Firehose drop-down list — to simplify your architecture and send logs to Elastic, so you can monitor and safeguard your multi-account AWS environments.

Announcing the Kinesis Firehose method

Elastic currently provides both agent-based and serverless mechanisms, and we are pleased to announce the addition of the Kinesis Firehose method. This new method enables customers to directly ingest logs from AWS into Elastic, supplementing our existing options.

Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53) and ingests them into Elastic Cloud.
Elastic’s Serverless Forwarder (runs Lambda and available in AWS SAR) sends logs from Kinesis Data Stream, Amazon S3, and AWS Cloudwatch log groups into Elastic. To learn more about this topic, please see this blog post.
Amazon Kinesis Firehose directly ingests logs from AWS into Elastic (specifically, if you are running the Elastic Cloud on AWS).

In this blog, we will cover the last option since we have recently released the Amazon Kinesis Data Firehose integration. Specifically, we'll review:

A general overview of the Amazon Kinesis Data Firehose integration and how it works with AWS
Step-by-step instructions to set up the Amazon Kinesis Data Firehose integration on AWS and on Elastic Cloud

By the end of this blog, you'll be equipped with the knowledge and tools to simplify your AWS log management with Elastic Observability and Amazon Kinesis Data Firehose.

Prerequisites and configurations

If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.

You will need an account on Elastic Cloud and a deployed stack on AWS. Instructions for deploying a stack on AWS can be found here. This is necessary for AWS Firehose Log ingestion.
You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our documentation.
Finally, be sure to turn on VPC Flow Logs for the VPC where your application is deployed and send them to AWS Firehose.

Elastic’s Amazon Kinesis Data Firehose integration

Elastic has collaborated with AWS to offer a seamless integration of Amazon Kinesis Data Firehose with Elastic, enabling direct ingestion of data from Amazon Kinesis Data Firehose into Elastic without the need for Agents or Beats. All you need to do is configure the Amazon Kinesis Data Firehose delivery stream to send its data to Elastic's endpoint. In this configuration, we will demonstrate how to ingest VPC Flow logs and Firewall logs into Elastic. You can follow a similar process to ingest other logs from your AWS environment into Elastic.

There are three distinct configurations available for ingesting VPC Flow and Network firewall logs into Elastic. One configuration involves sending logs through CloudWatch, and another uses S3 and Kinesis Firehose; each has its own unique setup. With Cloudwatch and S3 you can store and forward but with Kinesis Firehose you will have to ingest immediately. However, in this blog post, we will focus on this new configuration that involves sending VPC Flow logs and Network Firewall logs directly to Elastic.

We will guide you through the configuration of the easiest setup, which involves directly sending VPC Flow logs and Firewalls logs to Amazon Kinesis Data Firehose and then into Elastic Cloud.

Note: It's important to note that this setup is only compatible with Elastic Cloud on AWS and cannot be used with self-managed or on-premise or other cloud provider Elastic deployments.

Setting it all up

To begin setting up the integration between Amazon Kinesis Data Firehose and Elastic, let's go through the necessary steps.

Step 0: Get an account on Elastic Cloud

Create an account on Elastic Cloud by following the instructions provided to get started on Elastic Cloud.

Step 1: Deploy Elastic on AWS

You can deploy Elastic on AWS via two different approaches: through the UI or through Terraform. We’ll start first with the UI option.

After logging into Elastic Cloud, create a deployment on Elastic. It's crucial to make sure that the deployment is on Elastic Cloud on AWS since the Amazon Kinesis Data Firehose connects to a specific endpoint that must be on AWS.

After your deployment is created, it's essential to copy the Elasticsearch endpoint to ensure a seamless configuration process.

The Elasticsearch HTTP endpoint should be copied and used for Amazon Firehose destination configuration purposes, as it will be required. Here's an example of what the endpoint should look like:

https://elastic-O11y-log.es.us-east-1.aws.found.io

Alternative approach using Terraform

An alternative approach to deploying Elastic Cloud on AWS is by using Terraform. It's also an effective way to automate and streamline the deployment process.

To begin, simply create a Terraform configuration file that outlines the necessary infrastructure. This file should include resources for your Elastic Cloud deployment and any required IAM roles and policies. By using this approach, you can simplify the deployment process and ensure consistency across environments.

One easy way to create your Elastic Cloud deployment with Terraform is to use this Github repo. This resource lets you specify the region, version, and deployment template for your Elastic Cloud deployment, as well as any additional settings you require.

Step 2: To turn on Elastic's AWS integrations, navigate to the Elastic Integration section in your deployment

To install AWS assets in your deployment's Elastic Integration section, follow these steps:

Log in to your Elastic Cloud deployment and open Kibana.
To get started, go to the management section of Kibana and click on " Integrations."
Navigate to the AWS integration and click on the "Install AWS Assets" button in the settings.This step is important as it installs the necessary assets such as dashboards and ingest pipelines to enable data ingestion from AWS services into Elastic.

Step 3: Set up the Amazon Kinesis Data Firehose delivery stream on the AWS Console

You can set up the Kinesis Data Firehose delivery stream via two different approaches: through the AWS Management Console or through Terraform. We’ll start first with the console option.

To set up the Kinesis Data Firehose delivery stream on AWS, follow these steps:

Go to the AWS Management Console and select Amazon Kinesis Data Firehose.
Click on Create delivery stream.
Choose a delivery stream name and select Direct PUT or other sources as the source.

Choose Elastic as the destination.
In the Elastic destination section, enter the Elastic endpoint URL that you copied from your Elastic Cloud deployment.

Choose the content encoding and retry duration as shown above.
Enter the appropriate parameter values for your AWS log type. For example, for VPC Flow logs, you would need to specify the _ es_datastream_name _ and _ logs-aws.vpc flow-default _.
Configure the Amazon S3 bucket as the source backup for the Amazon Kinesis Data Firehose delivery stream failed data or all data, and configure any required tags for the delivery stream.
Review the settings and click on Create delivery stream.

In the example above, we are using the es_datastream_name parameter to pull in VPC Flow logs through the logs-aws.vpcflow-default datastream. Depending on your use case, this parameter can be configured with one of the following types of logs:

logs-aws.cloudfront_logs-default (AWS CloudFront logs)
logs-aws.ec2_logs-default (EC2 logs in AWS CloudWatch)
logs-aws.elb_logs-default (Amazon Elastic Load Balancing logs)
logs-aws.firewall_logs-default (AWS Network Firewall logs)
logs-aws.route53_public_logs-default (Amazon Route 53 public DNS queries logs)
logs-aws.route53_resolver_logs-default (Amazon Route 53 DNS queries & responses logs)
logs-aws.s3access-default (Amazon S3 server access log)
logs-aws.vpcflow-default (AWS VPC flow logs)
logs-aws.waf-default (AWS WAF Logs)

Alternative approach using Terraform

Using the " aws_kinesis_firehose_delivery_stream" resource in Terraform is another way to create a Kinesis Firehose delivery stream, allowing you to specify the delivery stream name, data source, and destination - in this case, an Elasticsearch HTTP endpoint. To authenticate, you'll need to provide the endpoint URL and an API key. Leveraging this Terraform resource is a fantastic way to automate and streamline your deployment process, resulting in greater consistency and efficiency.

Here's an example code that shows you how to create a Kinesis Firehose delivery stream with Terraform that sends data to an Elasticsearch HTTP endpoint:

resource "aws_kinesis_firehose_delivery_stream" “Elasticcloud_stream" {
  name        = "terraform-kinesis-firehose-ElasticCloud-stream"
  destination = "http_endpoint”
  s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.bucket.arn
    buffer_size        = 5
    buffer_interval    = 300
    compression_format = "GZIP"
  }
  http_endpoint_configuration {
    url        = "https://cloud.elastic.co/"
    name       = “ElasticCloudEndpoint"
    access_key = “ElasticApi-key"
    buffering_hints {
      size_in_mb = 5
      interval_in_seconds = 300
    }

   role_arn       = "arn:Elastic_role"
   s3_backup_mode = "FailedDataOnly"
  }
}

Step 4: Configure VPC Flow Logs to send to Amazon Kinesis Data Firehose

To complete the setup, you'll need to configure VPC Flow logs in the VPC where your application is deployed and send them to the Amazon Kinesis Data Firehose delivery stream you set up in Step 3.

Enabling VPC flow logs in AWS is a straightforward process that involves several steps. Here's a step-by-step details to enable VPC flow logs in your AWS account:

Select the VPC for which you want to enable flow logs.
In the VPC dashboard, click on "Flow Logs" under the "Logs" section.
Click on the "Create Flow Log" button to create a new flow log.
In the "Create Flow Log" wizard, provide the following information:

Choose the target for your flow logs: In this case, Amazon Kinesis Data Firehose in the same AWS account.

Provide a name for your flow log.
Choose the VPC and the network interface(s) for which you want to enable flow logs.
Choose the flow log format: either AWS default or Custom format.

Configure the IAM role for the flow logs. If you have an existing IAM role, select it. Otherwise, create a new IAM role that grants the necessary permissions for the flow logs.
Review the flow log configuration and click "Create."

Create the VPC Flow log.

Step 5: After a few minutes, check if flows are coming into Elastic

To confirm that the VPC Flow logs are ingesting into Elastic, you can check the logs in Kibana. You can do this by searching for the index in the Kibana Discover tab and filtering the results by the appropriate index and time range. If VPC Flow logs are flowing in, you should see a list of documents representing the VPC Flow logs.

Step 6: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] VPC Flow Log Overview dashboard

Finally, there is an Elastic out-of-the-box (OOTB) VPC Flow logs dashboard that displays the top IP addresses that are hitting your VPC, their geographic location, time series of the flows, and a summary of VPC flow log rejects within the selected time frame. This dashboard can provide valuable insights into your network traffic and potential security threats.

Note: For additional VPC flow log analysis capabilities, please refer to this blog.

Step 7: Configure AWS Network Firewall Logs to send to Kinesis Firehose

To create a Kinesis Data Firehose delivery stream for AWS Network firewall logs, first log in to the AWS Management Console, navigate to the Kinesis service, select "Data Firehose", and follow the step-by-step instructions as shown in Step 3. Specify the Elasticsearch endpoint, API key, add a parameter (_ es_datastream_name=logs-aws.firewall_logs-default _), and create the delivery stream.

Second, to set up a Network Firewall rule group to send logs to the Kinesis Firehose, go to the Network Firewall section of the console, create a rule group, add a rule to allow traffic to the Kinesis endpoint, and attach the rule group to your Network Firewall configuration. Finally, test the configuration by sending traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Kindly follow the instructions below to set up a firewall rule and logging.

Set up a Network Firewall rule group to send logs to Amazon Kinesis Data Firehose:

Go to the AWS Management Console and select Network Firewall.
Click on "Rule groups" in the left menu and then click "Create rule group."
Choose "Stateless" or "Stateful" depending on your requirements, and give your rule group a name. Click "Create rule group."
Add a rule to the rule group to allow traffic to the Kinesis Firehose endpoint. For example, if you are using the us-east-1 region, you would add a rule like this:json

{
  "RuleDefinition": {
    "Actions": [
      {
        "Type": "AWS::KinesisFirehose::DeliveryStream",
        "Options": {
          "DeliveryStreamArn": "arn:aws:firehose:us-east-1:12387389012:deliverystream/my-delivery-stream"
        }
      }
    ],
    "MatchAttributes": {
      "Destination": {
        "Addresses": ["api.firehose.us-east-1.amazonaws.com"]
      },
      "Protocol": {
        "Numeric": 6,
        "Type": "TCP"
      },
      "PortRanges": [
        {
          "From": 443,
          "To": 443
        }
      ]
    }
  },
  "RuleOptions": {
    "CustomTCPStarter": {
      "Enabled": true,
      "PortNumber": 443
    }
  }
}

Save the rule group.

Attach the rule group to your Network Firewall configuration:

Go to the AWS Management Console and select Network Firewall.
Click on "Firewall configurations" in the left menu and select the configuration you want to attach the rule group to.
Scroll down to "Associations" and click "Edit."
Select the rule group you created in Step 2 and click "Save."

Test the configuration:

Send traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Step 8: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] Firewall Log dashboard

Wrapping up

We’re excited to bring you this latest integration for AWS Cloud and Kinesis Data Firehose into production. The ability to consolidate logs and metrics to gain visibility across your cloud and on-premises environment is crucial for today’s distributed environments and applications.

From EC2, Cloudwatch, Lambda, ECS and SAR, Elastic Integrations allow you to quickly and easily get started with ingesting your telemetry data for monitoring, analytics, and observability. Elastic is constantly delivering frictionless customer experiences, allowing anytime, anywhere access to all of your telemetry data — this streamlined, native integration with AWS is the latest example of our commitment.

Start a free trial today

You can begin with a 7-day free trial of Elastic Cloud within the AWS Marketplace to start monitoring and improving your users' experience today!

AWS VPC Flow log analysis with GenAI in Elastic

Fri, 07 Jun 2024 00:00:00 GMT

Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :

Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications
Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic
Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns
Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie.

AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.

Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:

Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading
Create an ML job to analyze different fields of the VPC Flow log
Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic
ES|QL will help understand and analyze add latency for patterns.

In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Follow the steps in the following blog to get AWS’s three-tier app installed instructed in git, and bring in the AWS VPC Flow logs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

SLO with VPC Flow Logs

Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.
Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting.
Manage, with dashboards, all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.

Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where aws.vpcflow.action=ACCEPT and we define the target at 85%.

As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.

Analyzing the SLO with AI Assistant

Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)

AI Assistant analysis:

what were the top 3 source.address that had aws.vpcflow.action=REJECT over the last 7 days, which is causing this SLO issue? - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.
- The answer: A table with the highest count = 42670 and source.address = 79.110.62.185
- There is one singular source.address that is causing the loss in SLO.
What is the largest number of aws.vpcflow.action=REJECT in a 30 min time frame for the last 3 days where the source.address=79.110.62.185? - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages.
- **The answer: ** "The largest number of aws.vpcflow.action=REJECT in a 30-minute time frame for the last 3 days where the source.address is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”
- It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.
for the logs with source.address="79.110.62.185" was there any country code of source.geo.country_iso_code field present. If yes what is the value - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.
- The answer: Yes, there is a country code present in the source.geo.country_iso_code field for logs with source.address="79.110.62.185". The value is BG (Bulgaria).
Is there a specific destination.address where source.address=79.110.62.185 is getting a aws.vpcflow.action=REJECT. Give me both the destination.address and the number of REJECTs for that destination.address?
- The answer: destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.
Graph the number of REJECT vs ACCEPT for source.address="79.110.62.185" over the last 7 days. The graph is on a daily basis in a singular graph - We asked this question to see what the comparison is between ACCEPT and REJECT.
- The answer: See the animated GIF to see that the generated graph is fairly stable
Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days? - We wanted to see if there was any other spike
- The answer - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. source.address: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z

Watch the flow

Potential issue:

he server handling requests from source 79.110.62.185 is potentially having an issue.

Again using logs, we essentially asked the AI Assistant to give the eni ids where the internal ip address was 10.0.0.27

From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.

VPC Flow logs come with a large amount of information. The full set of fields is listed in AWS docs. We will use a specific subset to help detect anomalies.

We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.

The config we used utilizes:

Detectors:

destination.address
destination.port

Influencers:

source.address
aws.vpcflow.action
destination.geo.region_iso_code

The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against destination.address values from a specific source.address and/or destination.geo.region_iso_code location.

The job once run reveals something interesting:

Notice that source.address 185.244.212.67 has had a high REJECT rate in the last 30 days.

Notice where we found this before? In the AI Assistant!!!!!

While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Best Practices for Log Management: Leveraging Logs for Faster Problem Resolution

Wed, 11 Sep 2024 00:00:00 GMT

In today's rapid software development landscape, efficient log management is crucial for maintaining system reliability and performance. With expanding and complex infrastructure and application components, the responsibilities of operations and development teams are ever-growing and multifaceted. This blog post outlines best practices for effective log management, addressing the challenges of growing data volumes, complex infrastructures, and the need for quick problem resolution.

Understanding Logs and Their Importance

Logs are records of events occurring within your infrastructure, typically including a timestamp, a message detailing the event, and metadata identifying the source. They are invaluable for diagnosing issues, providing early warnings, and speeding up problem resolution. Logs are often the primary signal that developers enable, offering significant detail for debugging, performance analysis, security, and compliance management.

The Logging Journey

The logging journey involves three basic steps: collection and ingestion, processing and enrichment, and analysis and rationalization. Let's explore each step in detail, covering some of the best practices for each section.

1. Log Collection and Ingestion

Collect Everything Relevant and Actionable

The first step is to collect all logs into a central location. This involves identifying all your applications and systems and collecting their logs. Comprehensive data collection ensures no critical information is missed, providing a complete picture of your system's behavior. In the event of an incident, having all logs in one place can significantly reduce the time to resolution. It's generally better to collect more data than you need, as you can always filter out irrelevant information later, as well as delete logs that are no longer needed more quickly.

Leverage Integrations

Elastic provides over 300 integrations that simplify data onboarding. These integrations not only collect data but also come with dashboards, saved searches, and pipelines to parse the data. Utilizing these integrations can significantly reduce manual effort and ensure data consistency.

Consider Ingestion Capacity and Costs

An important aspect of log collection is ensuring you have sufficient ingestion capacity at a manageable cost. When assessing solutions, be cautious about those that charge significantly more for high cardinality data, as this can lead to unexpectedly high costs in observability solutions. We'll talk more about cost effective log management later in this post.

Use Kafka for Large Projects

For larger organizations, implementing Kafka can improve log data management. Kafka acts as a buffer, making the system more reliable and easier to manage. It allows different teams to send data to a centralized location, which can then be ingested into Elastic.

2. Processing and Enrichment

Adopt Elastic Common Schema (ECS)

One key aspect of log collection is to have the most amount of normalization across all of your applications and infrastructure. Having a common semantic schema is crucial. Elastic contributed Elastic Common Schema (ECS) to OpenTelemetry (OTel), helping accelerate the adoption of OTel-based observability and security. This move towards a more normalized way to define and ingest logs (including metrics and traces) is beneficial for the industry.

Using ECS helps standardize field names and data structures, making data analysis and correlation easier. This common schema ensures your data is organized predictably, facilitating more efficient querying and reporting. Learn more about ECS here.

Optimize Mappings for High Volume Data

For high cardinality fields or those rarely used, consider optimizing or removing them from the index. This can improve performance by reducing the amount of data that needs to be indexed and searched. Our documentation has sections to tune your setup for disk usage, search speed and indexing speed.

Managing Structured vs. Unstructured Logs

Structured logs are generally preferable as they offer more value and are easier to work with. They have a predefined format and fields, simplifying information extraction and analysis. For custom logs without pre-built integrations, you may need to define your own parsing rules.

For unstructured logs, full-text search capabilities can help mitigate limitations. By indexing logs, full-text search allows users to search for specific keywords or phrases efficiently, even within large volumes of unstructured data. This is one of the main differentiators of Elastic's observability solution. You can simply search for any keyword or phrase and get results in real-time, without needing to write complex regular expressions or parsing rules at query time.

Schema-on-Read vs. Schema-on-Write

There are two main approaches to processing log data:

Schema-on-read: Some observability dashboarding capabilities can perform runtime transformations to extract fields from non-parsed sources on the fly. This is helpful when dealing with legacy systems or custom applications that may not log data in a standardized format. However, runtime parsing can be time-consuming and resource-intensive, especially for large volumes of data.
Schema-on-write: This approach offers better performance and more control over the data. The schema is defined upfront, and the data is structured and validated at the time of writing. This allows for faster processing and analysis of the data, which is beneficial for enrichment.

3. Analysis and Rationalization

Full-Text Search

Elastic's full-text search capabilities, powered by Elasticsearch, allow you to quickly find relevant logs. The Kibana Query Language (KQL) enhances search efficiency, enabling you to filter and drill down into the data to identify issues rapidly.

Here are a few examples of KQL queries:

// Filter documents where a field exists
http.request.method: *

// Filter documents that match a specific value
http.request.method: GET

// Search all fields for a specific value
Hello

// Filter documents where a text field contains specific terms
http.request.body.content: "null pointer"

// Filter documents within a range
http.response.bytes < 10000

// Combine range queries
http.response.bytes > 10000 and http.response.bytes <= 20000

// Use wildcards to match patterns
http.response.status_code: 4*

// Negate a query
not http.request.method: GET

// Combine multiple queries with AND/OR
http.request.method: GET and http.response.status_code: 400

Machine Learning Integration

Machine learning can automate the detection of anomalies and patterns within your log data. Elastic offers features like log rate analysis that automatically identify deviations from normal behavior. By leveraging machine learning, you can proactively address potential issues before they escalate.

It is recommended that organizations utilize a diverse arsenal of machine learning algorithms and techniques to effectively uncover unknown-unknowns in log files. Unsupervised machine learning algorithms, should be employed for anomaly detection on real-time data, with rate-controlled alerting based on severity.

By automatically identifying influencers, users can gain valuable context for automated root cause analysis (RCA). Log pattern analysis brings categorization to unstructured logs, while log rate analysis and change point detection help identify the root causes of spikes in log data.

Take a look at the documentation to get started with machine learning in Elastic.

Dashboarding and Alerting

Building dashboards and setting up alerting helps you monitor your logs in real-time. Dashboards provide a visual representation of your logs, making it easier to identify patterns and anomalies. Alerting can notify you when specific events occur, allowing you to take action quickly.

Cost-Effective Log Management

Use Data Tiers

Implementing index lifecycle management to move data across hot, warm, cold, and frozen tiers can significantly reduce storage costs. This approach ensures that only the most frequently accessed data is stored on expensive, high-performance storage, while older data is moved to more cost-effective storage solutions.

Our documentation explains how to set up Index Lifecycle Management.

Compression and Index Sorting

Applying best compression settings and using index sorting can further reduce the data footprint. Optimizing the way data is stored on disk can lead to substantial savings in storage costs and improve retrieval performance. As of 8.15, Elasticsearch provides an indexing mode called "logsdb". This is a highly optimized way of storing log data. This new way of indexing data uses 2.5 times less disk space than the default mode. You can read more about it here. This mode automatically applies the best combination of settings for compression, index sorting, and other optimizations that weren't accessible to users before.

Snapshot Lifecycle Management (SLM)

SLM allows you to back up your data and delete it from the main cluster, freeing up resources. If needed, data can be restored quickly for analysis, ensuring that you maintain the ability to investigate historical events without incurring high storage costs.

Learn more about SLM in the documentation.

Dealing with Large Amounts of Log Data

Managing large volumes of log data can be challenging. Here are some strategies to optimize log management:

Develop a logs deletion policy. Evaluate what data to collect and when to delete it.
Consider discarding DEBUG logs or even INFO logs earlier, and delete dev and staging environment logs sooner.
Aggregate short windows of identical log lines, which is especially useful for TCP security event logging.
For applications and code you control, consider moving some logs into traces to reduce log volume while maintaining detailed information.

Centralized vs. Decentralized Log Storage

Data locality is an important consideration when managing log data. The costs of ingressing and egressing large amounts of log data can be prohibitively high, especially when dealing with cloud providers.

In the absence of regional redundancy requirements, your organization may not need to send all log data to a central location. Consider keeping log data local to the datacenter where it was generated to reduce ingress and egress costs.

Cross-cluster search functionality enables users to search across multiple logging clusters simultaneously, reducing the amount of data that needs to be transferred over the network.

Cross-cluster replication is useful for maintaining business continuity in the event of a disaster, ensuring data availability even during an outage in one datacenter.

Monitoring and Performance

Monitor Your Log Management System

Using a dedicated monitoring cluster can help you track the performance of your Elastic deployment. Stack monitoring provides metrics on search and indexing activity, helping you identify and resolve performance bottlenecks.

Adjust Bulk Size and Refresh Interval

Optimizing these settings can balance performance and resource usage. Increasing bulk size and refresh interval can improve indexing efficiency, especially for high-throughput environments.

Logging Best Practices

Adjust Log Levels

Ensure that log levels are appropriately set for all applications. Customize log formats to facilitate easier ingestion and analysis. Properly configured log levels can reduce noise and make it easier to identify critical issues.

Use Modern Logging Frameworks

Implement logging frameworks that support structured logging. Adding metadata to logs enhances their usefulness for analysis. Structured logging formats, such as JSON, allow logs to be easily parsed and queried, improving the efficiency of log analysis. If you fully control the application and are already using structured logging, consider using Elastic's version of these libraries, which can automatically parse logs into ECS fields.

Leverage APM and Metrics

For custom-built applications, Application Performance Monitoring (APM) provides deeper insights into application performance, complementing traditional logging. APM tracks transactions across services, helping you understand dependencies and identify performance bottlenecks.

Consider collecting metrics alongside logs. Metrics can provide insights into your system's performance, such as CPU usage, memory usage, and network traffic. If you're already collecting logs from your systems, adding metrics collection is usually a quick process.

Traces can provide even deeper insights into specific transactions or request paths, especially in cloud-native environments. They offer more contextual information and excel at tracking dependencies across services. However, implementing tracing is only possible for applications you own, and not all developers have fully embraced it yet.

A combined logging and tracing strategy is recommended, where traces provide coverage for newer instrumented apps, and logging supports legacy applications and systems you don't own the source code for.

Conclusion

Effective log management is essential for maintaining system reliability and performance in today's complex software environments. By following these best practices, you can optimize your log management process, reduce costs, and improve problem resolution times.

Key takeaways include:

Ensure comprehensive log collection with a focus on normalization and common schemas.
Use appropriate processing and enrichment techniques, balancing between structured and unstructured logs.
Leverage full-text search and machine learning for efficient log analysis.
Implement cost-effective storage strategies and smart data retention policies.
Enhance your logging strategy with APM, metrics, and traces for a complete observability solution.

Continuously evaluate and adjust your strategies to keep pace with the growing volume and complexity of log data, and you'll be well-equipped to ensure the reliability, performance, and security of your applications and infrastructure.

Check out our other blogs:

Ready to get started? Use Elastic Observability on Elastic Cloud — the hosted Elasticsearch service that includes all of the latest features.

Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch

Mon, 19 Aug 2024 00:00:00 GMT

Introduction:

Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the Elastic Kubernetes Audit Log Integration.

In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:

AWS Custom Logs integration (which we will utilize in this blog)
AWS Firehose to send logs from Cloudwatch to Elastic
AWS General integration which supports many AWS sources

In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.

Kubernetes auditing documentation describes the need for auditing in order to get answers to the questions below:

What happened?
When did it happen?
Who initiated it?
What resource did it occur on?
Where was it observed?
From where was it initiated (Source IP)?
Where was it going (Destination IP)?

Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements.

We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.

Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the audit policy file. The audit policy file is submitted against the kube-apiserver. It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the kube-apiserver directly. For example, AWS EKS allows for this logging to be done only by the control plane.

In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.

A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS is logged on AWS CloudWatch in the following format:

Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour?

Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the Elastic Common Schema(ECS) to get the best search and analytics performance. This means that there needs to be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.

Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the ECS designed for parsing the Kubernetes audit logs already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.

What we’re going to do is:

Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and Elasticsearch AWS Custom Logs integration to read from logs from CloudWatch. Note: please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.
Create two simple ingest pipelines (we do this for best practices of isolation and composability)
The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline
The second custom pipeline will associate the JSON message field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then reroute the message to the correct data stream, kubernetes.audit_logs-default, which in turn applies all the proper mapping and ingest pipelines for the incoming message
The overall flow will be

1. Create an AWS CloudWatch integration:

a. Populate the AWS access key and secret pair values

b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page

2. Next, we will configure the custom ingest pipeline

We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline logs-aws_logs.generic@custom

From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration

PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    "processors": [
      {
        "pipeline": {
          "if": "ctx.message.contains('audit.k8s.io')",
          "name": "logs-aws-process-k8s-audit"
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field": "kubernetes.audit"
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "reroute": {
        "dataset": "kubernetes.audit_logs",
        "namespace": "default"
      }
    }
  ]
}

Let’s understand this further:

When we create a Kubernetes integration, we get a managed index template called logs-kubernetes.audit_logs that writes to the pipeline called logs-kubernetes.audit_logs-1.62.2 by default
If we look into the pipeline logs-kubernetes.audit_logs-1.62.2, we see that all the processor logic is working against the field kubernetes.audit. This is the reason why our json processor in the above code snippet is creating a field called kubernetes.audit before dropping the original message field and rerouting. Rerouting is directed to the kubernetes.audit_logs dataset that backs the logs-kubernetes.audit_logs-1.62.2 pipeline (dataset name is derived from the pipeline name convention that’s in the format logs--version)

3. Now let’s verify that the logs are actually flowing through and the audit message is being parsed

a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to deploy Elastic Agent and for this exercise we will deploy using docker which is quick and easy.

% docker run --env FLEET_ENROLL=1 --env FLEET_URL=<> --env FLEET_ENROLLMENT_TOKEN=<>  --rm docker.elastic.co/beats/elastic-agent:8.16.1

b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!

4. Let's do a quick recap of what we did

We configured CloudWatch integration in ElasticSearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration.

In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.

Customize your data ingestion with Elastic input packages

Tue, 26 Sep 2023 00:00:00 GMT

Elastic^® has enabled the collection, transformation, and analysis of data flowing between the external data sources and Elastic Observability Solution through integrations. Integration packages achieve this by encapsulating several components, including agent configuration, inputs for data collection, and assets like ingest pipelines, data streams, index templates, and visualizations. The breadth of these assets supported in the Elastic Stack increases day by day.

This blog dives into how input packages provide an extremely generic and flexible solution to the advanced users for customizing their ingestion experience in Elastic.

What are input packages?

An Elastic Package is an artifact that contains a collection of assets that extend the Elastic Stack, providing new capabilities to accomplish a specific task like integration with an external data source. The first use of Elastic packages is integration packages, which provide an end-to-end experience — from configuring Elastic Agent, to collecting signals from the data source, to ingesting them correctly and using the data once ingested.

However, advanced users may need to customize data collection, either because an integration does not exist for a specific data source, or even if it does, they want to collect additional signals or in a different way. Input packages are another type of Elastic package that provides the capability to configure Elastic Agent to use the provided inputs in a custom way.

Let’s look at an example

Say hello to Julia, who works as an engineer at Ascio Innovation firm. She is currently working with Oracle Weblogic server and wants to get a set of metrics for monitoring it. She goes ahead and installs Elastic Oracle Weblogic Integration, which uses Jolokia in the backend to fetch the metrics.

Now, her team wants to advance in the monitoring and has the following requirements:

We should be able to extract metrics other than the default ones, which are not supported by the default Oracle Weblogic Integration.
We want to have our own bespoke pipelines, visualizations, and experience.
We should be able to identify the metrics coming in from two different instances of Weblogic Servers by having data mapped to separate indices.

All the above requirements can be met by using the Jolokia input package to get a customized experience. Let's see how.

Julia can add the configuration of Jolokia input package as below, fulfilling the first requirement.

hostname, JMX Mappings for the fields you want to fetch for the JVM application, and the data set name to which the response fields would get mapped.

Julia can customize her data by writing her own ingest pipelines and providing her customized mappings. Also, she can then build her own bespoke dashboards, hence meeting her second requirement.

Let’s say now Julia wants to use another instance of Oracle Weblogic and get a different set of metrics.

This can be achieved by adding another instance of Jolokia input package and specifying a new data set name as shown in the screenshot below. The resultant metrics will be mapped to a different index/data set hence fulfilling her third requirement. This will help Julia to differentiate metrics coming in from two different instances of Oracle Weblogic.

The resultant metrics of the query will be indexed to the new data set, jolokia_second_dataset in the below example.

As we can see above, the Jolokia input package provides the flexibility to get new metrics by specifying different JMX Mappings, which are not supported in the default Oracle Weblogic integration (the user gets metrics from a predetermined set of JMX Mappings).

The Jolokia Input package also can be used for monitoring any Java-based application, which pushes its metrics through JMX. So a single input package can be used to collect metrics from multiple Java applications/services.

Elastic input packages

Elastic has started supporting input packages from the 8.8.0 release. Some of the input packages are now available in beta and will mature gradually:

SQL input package: The SQL input package allows you to execute queries against any SQL database and store the results in Elasticsearch^®.
Prometheus input package: This input package can collect metrics from Prometheus Exporters (Collectors).It can be used by any service exporting its metrics to a Prometheus endpoint.
Jolokia input package: This input package collects metrics from Jolokia agents running on a target JMX server or dedicated proxy server. It can be used for monitoring any Java-based application, which pushes its metrics through JMX.
Statsd input package: The statsd input package spawns a UDP server and listens for metrics in StatsD compatible format. This input can be used to collect metrics from services that send data over the StatsD protocol.
GCP Metrics input package: The GCP Metrics input package can collect custom metrics for any GCP service.

Try it out!

Now that you know more about input packages, try building your own customized integration for your service through input packages, and get started with an Elastic Cloud free trial.

We would love to hear from you about your experience with input packages on the Elastic Discuss forum or in the Elastic Integrations repository.

The DNA of DATA Increasing Efficiency with the Elastic Common Schema

Wed, 25 Sep 2024 00:00:00 GMT

The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.

In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.

This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.

If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:

Dataset Validation

Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:

With 10,365,152 documents in our targeted end-state:

Dataset Ingestion: Raw & Self

To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.


    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/raw/access.log"
      ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
          index => "nginx-raw"
          ilm_enabled => true
          manage_template => false
          user => "username"
          password => "password"
          ssl_verification_mode => none
          ecs_compatibility => disabled
          id => "NGINX-FILE_ES_Output"
      }
    }

For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:

    input {
      file {
        id => "NGINX_FILE_INPUT"
        path => "/etc/logstash/self/access.log"
        ecs_compatibility => disabled
        start_position => "beginning"
        mode => read
      }
    }
    filter {
      grok {
        match => { "message" => "%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}" }
      }
    }
    output {
      elasticsearch {
        hosts => ["https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-self"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Dataset Ingestion: ECS

Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.

For our use case of Nginx, we'll be using the associated integration's assets only.

The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.

Create your index template, and select the supplied component templates provided from your integration.

Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.

For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case, nginx-ecs and Elastic will handle all the rest!

    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/ecs/access.log"
      #ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-ecs"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Data Fidelity Comparison

Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.

However from a Discover perspective, we are limited to 6 fields!

Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.

From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.

Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.

Now what about Discover? There were 51 fields directly available to us for searching purposes:

Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%!

Storage Utilization Comparison

Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.

Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.

Conclusion

While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.

Let's review how we were able to maximize search, while minimizing storage

Installing integration assets for our dataset that we are going to ingest.

Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.

Accelerate log analytics in Elastic Observability with Automatic Import powered by Search AI

Wed, 04 Sep 2024 00:00:00 GMT

Elastic is accelerating the adoption of AI-driven log analytics by automating the ingestion of custom logs, which is increasingly important as the deployment of GenAI-based applications grows. These custom data sources must be ingested, parsed, and indexed effortlessly, enabling broader visibility and more straightforward root cause analysis (RCA) without requiring effort from Site Reliability Engineers (SREs). Achieving visibility across an enterprise IT environment is inherently challenging for SREs due to constant growth and change, such as new applications, added systems, and infrastructure migrations to the cloud. Until now, the onboarding of custom data has been costly and complex for SREs. With automatic import, SREs can concentrate on deploying, optimizing, and improving applications.

Automatic Import uses generative AI to automate the development of custom data integrations, reducing the time required from several days to less than 10 minutes and significantly lowering the learning curve for onboarding data. Powered by the Elastic Search AI Platform, it provides model-agnostic access to leverage large language models (LLMs) and grounds answers in proprietary data through retrieval augmented generation (RAG). This capability is further enhanced by Elastic's expertise in enabling observability teams to utilize any type of data and the flexibility of its Search AI Lake. Arriving at a crucial time when organizations face an explosion of applications and telemetry data, such as logs, Automatic Import streamlines the initial stages of data migration by simplifying data collection and normalization. It also addresses the challenges of building custom connectors, which can otherwise delay deployments, issue analysis, and impact customer experiences.

Enhancing AI Powered Observability with Automatic Import

Automatic Import builds on Elastic Observability’s AI-driven log analytics innovations—such as anomaly detection, log rate and pattern analysis, and Elastic AI Assistant, and further automates and simplifies SRE’s workflows. Automatic Import applies generative AI to automate the creation of custom data integrations, allowing SREs to focus on logs and other telemetry data. While Elastic provides over 400+ prebuilt data integrations, automatic import allows SREs to extend integrations to fit their workflows and expand visibility into production environments.

In conjunction with automatic import, Elastic is introducing Elastic Express Migration, a commercial incentive program designed to overcome migration inertia from existing deployments and contracts, providing a faster adoption path for new customers.

Automatic Import leverages Elastic Common Schema (ECS) with public LLMs to process and analyze data in ECS format which is also part of OpenTelemetry. Once the data is in, SRE’s can leverage Elastic’s RAG-based AI Assistant to solve root cause analysis (RCA) challenges in dynamic, complex environments.

Configuring and using Automatic Import

Automatic Import is available to everyone with an Enterprise license. Here is how it works:

The user configures connectivity to an LLM and uploads sample data
Automatic Import then extrapolates what to expect from the data source. These log samples are paired with LLM prompts that have been honed by Elastic engineers to reliably produce conformant Elasticsearch ingest pipelines.
Automatic Import then iteratively builds, tests, and tweaks a custom ingest pipeline until it meets Elastic integration requirements.

Automatic Import powered by the Elastic Search AI Platform

Within minutes, a validated custom integration is created that accurately maps raw data into ECS and custom fields, populates contextual information (such as related.* fields), and categorizes events.

Automatic Import currently supports Anthropic models via Elastic’s connector for Amazon Bedrock, and additional LLMs will be introduced soon. It supports JSON and NDJSON-based log formats currently.

Automatic Import workflow

SREs are constantly having to manage new tools and components that developers add into applications. Neo4j, is a database that doesn’t have an integration in Elastic. The following steps walk you through how to create an integration for Neo4j with automatic import:

Start by navigating to Integrations -> Create new integration.

Provide a name and description for the new data source.

Next, fill in other details and provide some sample data, anonymized as you see fit.

Click “Analyze logs” to submit integration details, sample logs, and expert-written instructions from Elastic to the specified LLM, which builds the integration package using generative AI. Automatic Import then fine-tunes the integration in an automated feedback loop until it is validated to meet Elastic requirements.

Review what automatic Import presents as recommended mappings to ECS fields and custom fields. You can easily adjust these settings if necessary.

After finalizing the integration, add it to Elastic Agent or view it in Kibana. It is now available alongside your other integrations and follows the same workflows as prebuilt integrations.

Upon deployment, you can begin analyzing newly ingested data immediately. Start by looking at the new Logs Explorer in Elastic Observability

Accelerate log-analytics with automatic import

Automatic Import lowers the time required to build and test custom data integrations from days to minutes, accelerating the switch to AI-driven log analytics. Elastic Observability pairs the unique power of Automatic Import with Elastic’s deep library of prebuilt data integrations, enabling wider visibility and fast data onboarding, along with AI-based features, such as the Elastic AI Assistant to accelerate RCA and reduce operational overhead.

Interested in our Express Migration program to level up to Elastic? Contact Elastic to learn more.

Elastic SQL inputs: A generic solution for database metrics observability

Mon, 11 Sep 2023 00:00:00 GMT

Elastic^® SQL inputs (metricbeat module and input package) allows the user to execute SQL queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch^®. This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic^®, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.

Why “Generic SQL”?

Elastic already has metricbeat and integration packages targeted for specific databases. One example is metricbeat for MySQL — and the corresponding integration package. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are not available for modification.

Whereas the Generic SQL inputs (metricbeat or input package) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).

Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, Generic SQL input and Generic SQL are used interchangeably.

Functionalities details

This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.

The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.

Now let's dive into the specific functionalities:

Different drivers supported

The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).

Response format

The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.

Syntax: response_format: table {{or}} variables

Response format table
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: table

This query returns a response similar to this:

"sql":{
      "metrics":{
         "counter_name":"User Connections ",
         "cntr_value":7
      },
      "driver":"mssql"
}

The response generated above adds the counter_name as a key in the document.

Response format variables
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: variables

The variable format takes the first variable in the query above as the key:

"sql":{
      "metrics":{
         "user connections ":7
      },
      "driver":"mssql"
}

In the above response, you can see the value of counter_name is used to generate the key in variable format.

Response optimization: merge_results

We are now supporting merging multiple query responses into a single event. By enabling merge_results , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.

Syntax: merge_results: true {{or}} false

In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.

Example:

In this example, we are using two different queries to fetch metrics from the performance counter.

merge_results: false
driver: "mssql"
sql_queries:
  - query: "SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
    response_format: table
  - query: "SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'"
    response_format: table

As you can see, the response for the above example generates a single document for each query.

The resulting document from the first query:

"sql":{
      "metrics":{
         "user_connections":7
      },
      "driver":"mssql"
}

And resulting document from the second query:

"sql":{
      "metrics":{
         "buffer_cache_hit_ratio":87
      },
      "driver":"mssql"
}

When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.

You can see the merged document in the below example:

"sql":{
      "metrics":{
         "user connections ":7,
         “buffer_cache_hit_ratio”:87
      },
      "driver":"mssql"
}

However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.

Introducing a new capability: fetch_from_all_databases

This is a new functionality to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.

Keep an eye out for the 8.10 release version where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.

Syntax: fetch_from_all_databases: true {{or}} false

Below is the sample query with fetch all databases flag as disabled:

fetch_from_all_databases: false
driver: "mssql"
sql_queries:
  - query: "SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';"

The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.

Below is the sample query with the fetch all databases flag as enabled:

fetch_from_all_databases: true
driver: "mssql"
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table

The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.

Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for all user DBs by default.

Using generic SQL: Metricbeat

The generic SQL metricbeat module provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. Here, you can find more information on configuring the generic SQL for different drivers with various examples.

Using generic SQL: Input package

The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQLinput package. The input package is currently available for early users as a beta release. Let's take a walk through how users can use generic SQL via the input package.

Configurations of generic SQL input package:

The configuration options for the generic SQL input package are as below:

Driver** :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.
Hosts: Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer here for examples.
SQL Queries: Here the user writes the SQL queries they want to fire and the response_format is specified.
Data set: The user specifies a data set name to which the response fields get mapped.
Merge results** :** This is an advanced setting, used to merge queries into a single event.

Metrics extensibility with customized SQL queries

Let's say a user is using MYSQL Integration, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.

This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new data set name as shown in the screenshot below.

This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.

When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.

Customizing user experience

Users can customize their data by writing their own ingest pipelines and providing their customized mappings. Users can also build their own bespoke dashboards.

As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).

The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.

Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.

Try it out!

Now that you know about various use cases and features of generic SQL, get started with Elastic Cloud and try using the SQL input package for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like Microsoft SQL Server, Oracle, and more — go ahead and give the SQL input package a swirl.

Future-proof your logs with ecs@mappings template

Mon, 23 Sep 2024 00:00:00 GMT

As the Elasticsearch ecosystem evolves, so do the tools and methodologies designed to streamline data management. One advancement that will significantly benefit our community is the ecs@mappings component template.

ECS (Elastic Common Schema) is a standardized data model for logs and metrics. It defines a set of common field names and data types that help ensure consistency and compatibility.

ecs@mappings is a component template that offers an Elastic-maintained definition of ECS mappings. Each Elasticsearch release contains an always up-to-date definition of all ECS fields.

Elastic Common Schema and Open Telemetry

Elastic will preserve our user's investment in Elastic Common Schema by donating ECS to Open Telemetry. Elastic participates and collaborates with the OTel community to merge ECS and Open Telemetry's Semantic Conventions over time.

The Evolution of ECS Mappings

Historically, users and integration developers have defined ECS (Elastic Common Schema) mappings manually within individual index templates and packages, each meticulously listing its fields. Although straightforward, this approach proved time-consuming and challenging to maintain.

To tackle this challenge, integration developers moved towards two primary methodologies:

Referencing ECS mappings
Importing ECS mappings directly

These methods were steps in the right direction but introduced their challenges, such as the maintenance cost of keeping the ECS mappings up-to-date with Elasticsearch changes.

Enter ecs@mappings

The ecs@mappings component template supports all the field definitions in ECS, leveraging naming conventions and a set of dynamic templates.

Elastic started shipping the ecs@mappings component template with Elasticsearch v8.9.0, including it in the logs-- index template.

With Elasticsearch v8.13.0, Elastic now includes ecs@mappings in the index templates of all the Elastic Agent integrations.

This move was a breakthrough because:

Centralized and official: With ecs@mappings, we now have an official definition of ECS mappings.
Out-of-the-box functionality: ECS mappings are readily available, reducing the need for additional imports or references.
Simplified maintenance: The need to manually keep up with ECS changes has diminished since the template from Elasticsearch itself remains up-to-date.

Enhanced Consistency and Reliability

With ecs@mappings, ECS mappings become the single source of truth. This unified approach means fewer discrepancies and higher consistency in data streams across integrations.

How Community Users Benefit

Community users stand to gain manifold from the adoption of ecs@mappings. Here are the key advantages:

Reduced configuration hassles: Whether you are an advanced user or just getting started, the simplified setup means fewer configuration steps and fewer opportunities for errors.
Improved data integrity: Since ecs@mappings ensures that field definitions are accurate and up-to-date, data integrity is maintained effortlessly.
Better performance: With less overhead in maintaining and referencing ECS fields, your Elasticsearch operations run more smoothly.
Enhanced documentation and discoverability: As we standardize ECS mappings, the documentation can be centralized, making it easier for users to discover and understand ECS fields.

Let's explore how the ecs@mappings component template helps users achieve these benefits.

Reduced configuration hassles

Modern Elasticsearch versions come with out-of-the-box full ECS field support (see the “requirements” section later for specific versions).

For example, the Custom AWS Logs integration installed on a supported Elasticsearch cluster already includes the ecs@mappings component template in its index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      ...,
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
    ...

There is no need to import or define any ECS field.

Improved data integrity

The ecs@mappings component template supports all the existing ECS fields. If you use any ECS field in your document, it will accurately have the expected type.

To ensure that ecs@mappings is always up to date with the ECS repository, we set up a daily automated test to ensure that the component template supports all fields.

Better Performance

Compact definitions

The ECS field definition is exceptionally compact; at the time of this writing, it is 228 lines long and supports all ECS fields. To learn more, see the ecs@mappings component template source code.

It relies on naming conventions and uses dynamic templates to achieve this compactness.

Lazy mapping

Elasticsearch only adds existing document fields to the mapping, thanks to dynamic templates. The lazy mapping keeps memory overhead at a minimum, improving cluster performance and making field suggestions more relevant.

Enhanced documentation and discoverability

All Elastic Agent integrations are migrating to the ecs@mappings component template. These integrations no longer need to add and maintain ECS field mappings and can reference the official ECS Field Reference or the ECS source code in the Git repository: https://github.com/elastic/ecs/.

Getting started

Requirements

To leverage the ecs@mappings component template, ensure the following stack version:

8.9.0: if your data stream uses the logs index template or you define your index template.
8.13.0: if your data stream uses the index template of an Elastic Agent integration.

Example

We will use the Custom AWS Logs integration to show you how ecs@mapping can handle mapping for any out-of-the-box ECS field.

Imagine you want to ingest the following log event using the Custom AWS Logs integration:

{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

Dev Tools

Kibana offers an excellent tool for experimenting with Elasticseatch API, the Dev Tools console. With the Dev Tools, users can run all API requests quickly and without much friction.

To open the Dev Tools:

Open Kibana
Select Management > Dev Tools > Console

Elasticsearch version < 8.13

On Elasticsearch versions before 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        }
      }
    }
  ]
}

As you can see, it does not include the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "long"
          }
        }
      }
    }
  }
}

These mappings do not align with ECS, so users and developers had to maintain them.

Elasticsearch version >= 8.13

On Elasticsearch versions equal to or newer to 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        },
        "ignore_missing_component_templates": [
          "logs-aws_logs.generic@custom"
        ]
      }
    }
  ]
}

The index template for logs-aws_logs.generic now includes the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "wildcard",
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            }
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "float"
          }
        }
      }
    }
  }
}

In Elasticsearch 8.13, fields like command_line and custom_score get their definition from ECS out-of-the-box.

These mappings align with ECS, so users and developers do not have to maintain them. The same applies to all the hundreds of field definitions in the Elastic Common Schema. You can achieve this by including a 200-liner component template in your data stream.

Caveats

Some aspects of how the ecs@mappings component template deals with data types are worth mentioning.

ECS types are not enforced

The ecs@mappings component template does not contain mappings for ECS fields where dynamic mapping already uses the correct field type. Therefore, if you send a field value with a compatible but wrong type, Elasticsearch will not coerce the value.

For example, if you send the following document with a faas.coldstart field (defined as boolean in ECS):

{
  "faas.coldstart": "true"
}

Elasticsearch will map faas.coldstart as a keyword and not a boolean. Therefore, you need to make sure that the values you ingest to Elasticsearch use the right JSON field types, according to how they’re defined in ECS.

This is the tradeoff for having a compact and efficient ecs@mappings component template. It also allows for better compatibility when dealing with a mix of ECS and custom fields because documents won’t be rejected if the types are not consistent with the ones defined in ECS.

Conclusion

The introduction of ecs@mappings marks a significant improvement in managing ECS mappings within Elasticsearch. By centralizing and streamlining these definitions, we can ensure higher consistency, reduced maintenance, and better overall performance.

Whether you're an integration developer or a community user, moving to ecs@mappings represents a step towards more efficient and reliable Elasticsearch operations. As we continue incorporating feedback and evolving our tools, your journey with Elasticsearch will only get smoother and more rewarding.

Join the Conversation

Do you have questions or feedback about ecs@mappings? Post on our helpful community of users on our community discussion forum and Slack instance and share your experiences. Your input is invaluable in helping us fine-tune these advancements for the entire community.

Happy mapping!

How to remove PII from your Elastic data in 3 easy steps

Tue, 20 Jun 2023 00:00:00 GMT

Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?

Elasticsearch, with its long experience in machine learning, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.

If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:

In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.

Specifically, we’ll walk through setting up a named entity recognition (NER) model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.

Loading the trained model

Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:

git clone https://github.com/elastic/eland.git

Navigate into the recently downloaded client:

cd eland/

Now let’s build the client:

docker build -t elastic/eland .

From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.

If you’re using the Elastic Cloud or have signed certificates, simply run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --hub-model-id dslim/bert-base-NER --task-type ner --start

If you’re using self-signed certificates, run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start

From here you’ll witness the Eland client in action downloading the trained model from HuggingFace and automatically deploying it into your cluster!

Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”

Now click the Synchronize button.

That’s it! Congratulations, you just loaded your first trained model into Elastic!

Create the redact processor and ingest pipeline

From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{{{message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\r\n                }\r\n                ctx['redacted']=msg"
      }
    },
    {
      "redact": {
        "field": "redacted",
        "patterns": [
          "%{EMAILADDRESS:EMAIL}",
          "%{IP:IP_ADDRESS}",
          "%{CREDIT_CARD:CREDIT_CARD}",
          "%{SSN:SSN}",
          "%{PHONE:PHONE}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": "\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}",
          "SSN": "\d{3}-\d{2}-\d{4}",
          "PHONE": "\d{3}-\d{3}-\d{4}"
        }
      }
    },
    {
      "remove": {
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "pii_script-redact"
      }
    }
  ]
}

OK, but what does each processor really do? Let’s walk through each processor in detail here:

The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.
The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.
The SCRIPT processor then replaced the detected entities within the redacted field from the message field.
Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).
The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.
The ON_FAILURE / SET processor captures any errors just in case we have them.

Slice your PII

Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.

Click on the Manage button, and then click Edit.

Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.

{
  "_source":
    {
      "message": "John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2",
    },
}

Simply press the Run the pipeline button, and you will then see the following output:

What’s next?

After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.

Conclusion

With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.

Here’s a quick recap of what we covered:

Loading a pre-trained named entity recognition model into an Elastic cluster
Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion
Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Gaining new perspectives beyond logging: An introduction to application performance monitoring

Tue, 30 May 2023 00:00:00 GMT

Prioritize customer experience with APM and tracing

Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is application performance monitoring (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction.

Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution.

There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.

The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.

What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.

Cultivating a new monitoring mindset: How to drive APM and tracing adoption

To get teams to shift their troubleshooting mindset, what organizational changes need to be made?

Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities.

This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes.

From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.

In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption:

Ingest: What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?
Integrate: How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?
Analytics and AIOPs: Improve the customer experience and reduce the noise through machine learning.
Scale and total cost of ownership: Roll out enterprise-wide tracing and adopt strategies to deal with data volume.

1. Ingest

Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.

Getting started with instrumentation

What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:

Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.

From here things started to evolve, and one of the first additions to APM was to add in tracing.

For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did.

OpenTelemetry has agents like this, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling.

Things have moved on since then, and Node.JS and Python are now popular.

As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar here, which supports Python, Node.JS, Java, and DotNet.

The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this.

You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and here.

from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers="Authorization=Bearer%20"+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route("/completion")
@tracer.start_as_current_span("do_work")
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count",1)

By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.

Sampling

Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made.

Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.

Head-based sampling
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.

With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.

Tail-based sampling
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold.

We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.

OpenTelemetry Semantic Conventions and Elastic Common Schema

OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.

Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.

Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances.

Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security.

Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.

We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.

2. Integrate

Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for synthetic monitoring and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.

APM + logs = superpowers!

Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. This is normally simple to enable.

With this enabled, you will also get automated injection of useful fields like these:

service.name, service.version, service.environment
trace.id, transaction.id, error.id

This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:

If this is available to you, we highly recommend turning it on.

Deploying APM inside Kubernetes

It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this.

1. Auto instrumentation using sidecars
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications.

The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use Kustomize to add the required command line arguments to bootstrap your agents.

If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.

OpenTelemetry has an operator that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or helm charts that can achieve the same result.

2. Baking APM into containers or code
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.

Have a look at an example here using the OpenTelemetry Java Agent:

# Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar

3. Tracing using a service mesh (Envoy/Istio)
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.

The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. Learn more about this option.

Synthetics Universal Profiling

Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and continuous profiling being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.

Synthetics
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.

This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases.

Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.

Universal profiling
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. “Continuous profiling” is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.

Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like eBPF can allow you to see all the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”

Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together.

Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.

3. Analytics and AIOps

The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from inside applications.

This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.

In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.

Getting business data into your traces

There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available here and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports adding extensions in its auto-instrumentation agents. Most other APM vendors usually have something similar.

def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)

        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count", counters['completion_count'])
            span.set_attribute("token_count", token_count)
            span.set_attribute("prompt_tokens", prompt_tokens)
            span.set_attribute("completion_tokens", completion_tokens)
            span.set_attribute("model", response.model)
            span.set_attribute("cost", cost)
            span.set_attribute("response", strResponse)
        return response
    return wrapper

Using business data for fun and profit

Once you have the business data in your traces, you can start to have some fun with it. Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.

A lot of this is powered by machine learning, which can classify transactions or do anomaly detection. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.

SLIs and SLOs

Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations.

By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.

In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.

Define SLOs and SLIs required to track them.
Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.
Define SLIs to be measured from a user experience perspective.
Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.

Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.

4. Scale and total cost of ownership

With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned).

In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data.

With this configuration in place, your “Observability pipeline” architecture would look like this:

This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.

Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation.

And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage.

This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.

Conclusion

In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.

In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.

By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.

Learn more about APM at Elastic.

Kibana: How to create impactful visualisations with magic formulas ? (part 1)

Mon, 09 Sep 2024 00:00:00 GMT

Kibana: How to create impactful visualizations with magic formulas? (part 1)

Introduction

In the previous blog post, Designing Intuitive Kibana Dashboards as a non-designer, we highlighted the importance of creating intuitive dashboards. It demonstrated how simple changes (grouping themes, changing type charts, and more) can make a difference in understanding your data. When delivering courses like Data Analysis with Kibana or Elastic Observability Engineer courses, we emphasize this blog post and how these changes help bring essential information to the surface. I like a complementary approach to reach this goal: using two colors to separate the highest data values from the common ones.

To illustrate this idea, we will use the Sample flight data dataset. Now, let’s compare two visualizations ranking the top 10 destination countries per total number of flights. Which visualization has a higher impact?

If you chose the second one, you may be wondering how this was done with the Kibana Lens editor. While preparing for the certification last year, I found a way to achieve this result. The secret is using two different layers and some magic formulas. This post will explain how math in Lens formulas helps create two data-color visualizations.

We will start with the first example that emphasizes only the highest value of the dataset we are focusing on. The second example describes how to highlight other high values (as shown in the illustration above).

[Note: the tips explained in this blog post can be applied from v 7.15]

Only the highest value

To understand how math helps to separate high values from common ones, let’s start with this first example: emphasizing only the highest value.

We start with a bar horizontal chart:

We need to identify the highest value of the scope we are currently examining. We will use one proper overall_* function: the overall_max(), a pipeline function (equivalent to a pipeline aggregation in Query DSL).

In our example, we group the flights by country(destination). This means we count the number of flights for each DestCountry (= 1 bucket). The overall_max() will select which bucket has the highest value.

The math trick here is to divide the number of flights per bucket by the maximum value found among all buckets. Only one bucket will return 1: the bucket matching the max value found by overall_max(). All the other buckets will return a value < 1 and >0. We use floor() to ensure any 0.xxx values are rounded to 0.

Now, we can multiple it with a count() and we have our formula for the 1st layer!

Layer 1: count()*floor(count()/overall_max(count()))

From here, in Lens Editor, we duplicate the layer to adjust the formula of the second layer containing the rest of the data. We need to append another count() followed by the minus operator to the formula. This is the other trick. In this layer, we just need to ensure the highest value is not represented, which will happen only once. It is when count() = overall_max(), which is = 1 when we divide them.

Layer 2: count() - count()*floor(count()/overall_max(count()))

To achieve a nice merge of these two layers, we need to do the following adjustments in both:

select bar horizontal stacked
Vertical axis: change”Rank by” to Custom and ensure Rank function is “Count”

Here is the final setup of the two layers:

Layer 1: count()*floor(count()/overall_max(count()))

Layer 2: count() - count()*floor(count()/overall_max(count()))

This visualization also works well for time series data where you need to quickly highlight which time period (12h in the example below) had the highest number of flights:

Above the surface

Building on what we have done earlier, we can extend the approach to get other high values above the surface. Let’s see which formula we used to create the visualization in the introduction:

For this visualization, we used a property of the round() function. This function brings in only the values greater than 50% of the highest value.

50% of max explanation" />

Let's duplicate our first visualization and swap out the floor() function with round().

Layer 1: count()*round(count()/overall_max(count()))

Layer 2: count() - count()*round(count()/overall_max(count()))

It was an easy fix.
What if we want to extend the first layer further by adding more high values?
For instance, we would like all the values above the average.

To do this, we use overall_average() as a new reference value instead of the overall_max () reference to separate the eligible values in Layer 1.

As we are comparing against the average value among all the buckets, the division might return values greater than 1.

Here, the clamp() function nicely solves this issue.

According to the formula reference, clamp() "limits the value from a minimum to maximum". Combining clamp() and floor() ensures that there are only two possible output values: either the minimum value ( 0 ) or the maximum value ( 1 ) given as parameters.

Applied to our flights dataset, it highlights the country destinations that have more flights than the average:

Layer 1: count()*clamp(floor(count()/overall_average(count())),0,1)

Layer 2: count() - count()*clamp(floor(count()/overall_average(count())),0,1)

It also opens up options for using other dynamic references. For instance, we could place all the values greater than 60% of the highest above the surface ( > 0.6*overall_max(count())). We can tune our formula as follow:


count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)

Conclusion

In the first part, we have seen the main tips allowing us to create a two-color histogram:

Two layers: one for the highest value and one for the remaining values
Visualization type: bar horizontal/vertical stacked
To separate the data we use a formula where only the highest value return 1 otherwise 0

Then in the second part, we have seen how we can extend this principle to embrace more high values above the surface. This approach can be summarized as follows:

Start with layer 1 focusing on the high value: count()*
Duplicate the layer and adjust the formula:
( count() - count()*)

Finally, we provide 4 generic formulas that are ready to use to spice up your dashboards:


1. Only the highest
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.1. Above the surface : high values (above 50% of the max value)
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.2. Above the surface : all values above the overall average
Layer 1	`count()*clamp(floor(count()/overall_average(count())),0,1)`
Layer 2	`count() - count()*clamp(floor(count()/overall_average(count())),0,1)`


2.2. Above the surface : all the values greater than 60% of the highest
Layer 1	`count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`
Layer 2	`count() - count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`

Try these examples out for yourself by signing up for a free trial of Elastic Cloud or download the self-managed version of the Elastic Stack for free. If you have additional questions about getting started, head on over to the Kibana forum or check out the Kibana documentation guide.
In the next blog post, we will see how the new function ifelse() (introduced in version 8.6) will greatly simplify the creation of visualizations with more advanced formulas.

References:

Designing intuitive Kibana dashboards as a non-designer
Kibana: Lens editor - use formula to perform math
Discovering the clamp() function in this discussion (Thanks Marco!)

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Fri, 25 Oct 2024 00:00:00 GMT

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Introduction

Elastic observability strategy is increasingly aligned with OpenTelemetry. With the recent launch of Elastic Distributions of OpenTelemetry we’re expanding our offering to make it easier to use OpenTelemetry, the Elastic Agent now offers an "otel" mode, enabling it to run a custom distribution of the OpenTelemetry Collector, seamlessly enhancing your observability onboarding and experience with Elastic.

This post is designed to assist users familiar with Logstash transitioning to OpenTelemetry by demonstrating how to convert some standard Logstash pipelines into corresponding OpenTelemetry Collector configurations.

What is OpenTelemetry Collector and why should I care?

OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to re-instrument their observability when switching platforms.

By embracing OpenTelemetry, you have access to these benefits:

Unified Observability: By using the OpenTelemetry Collector, you can collect and manage logs, metrics, and traces from a single tool, providing holistic observability into your system's performance and behavior. This simplifies monitoring and debugging in complex, distributed environments like microservices.
Flexibility and Scalability: Whether you're running a small service or a large distributed system, the OpenTelemetry Collector can be scaled to handle the amount of data generated, offering the flexibility to deploy as an agent (running alongside applications) or as a gateway (a centralized hub).
Open Standards: Since OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF), it ensures that you're working with widely accepted standards, contributing to the long-term sustainability and compatibility of your observability stack.
Simplified Telemetry Pipelines: The ability to build pipelines using receivers, processors, and exporters simplifies telemetry management by centralizing data flows and minimizing the need for multiple agents.

In the next sections, we will explain how OTEL Collector and Logstash pipelines are structured, and we will clarify how the steps for each option are used.

OTEL Collector Configuration

An OpenTelemetry Collector Configuration has different sections:

Receivers: Collect data from different sources.
Processors: Transform the data collected by receivers
Exporters: Send data to different collectors
Connectors: Link two pipelines together
Service: defines which components are active
- Pipelines: Combine the defined receivers, processors, exporters, and connectors to process the data
- Extensions are optional components that expand the capabilities of the Collector to accomplish tasks not directly involved with processing telemetry data (e.g., health monitoring)
- Telemetry where you can set observability for the collector itself (e.g., logging and monitoring)

We can visualize it schematically as follows:

We refer to the official documentation Configuration | OpenTelemetry for an in-depth introduction in the components.

Logstash pipeline definition

A Logstash pipeline is composed of three main components:

Input Plugins: Allow us to read data from different sources
Filters Plugins: Allow us to transform and filter the data
Output Plugins: Allow us to send the data

Logstash also has a special input and a special output that allow the pipeline-to-pipeline communication, we can consider this as a similar concept to an OpenTelemetry connector.

Logstash pipeline compared to Otel Collector components

We can schematize how Logstash Pipeline and OTEL Collector pipeline components can relate to each other as follows:

Enough theory! Let us dive into some examples.

Convert a Logstash Pipeline into OpenTelemetry Collector Pipeline

Example 1: Parse and transform log line

Let's consider the below line:

2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404

We will apply the following steps:

Read the line from the file /tmp/demo-line.log.
Define the output to be an Elasticsearch datastream logs-access-default.
Extract the @timestamp, user.name, client.ip, client.port, url.path and http.status.code.
Drop log messages related to the SYSTEM user.
Parse the date timestamp with the relevant date format and store it in @timestamp.
Add a code http.status.code_description based on known codes' descriptions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo-line.log" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "access_log"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
    grok { #[3]
        match => {
            "message" => "%{TIMESTAMP_ISO8601:[date]}: user %{WORD:[user][name]} accessed from %{IP:[client][ip]}:%{NUMBER:[client][port]:int} path %{URIPATH:[url][path]} with error %{NUMBER:[http][status][code]}"
        }
    }
    if "_grokparsefailure" not in [tags] {
        if [user][name] == "SYSTEM" { #[4]
            drop {}
        }
        date { #[5]
            match => ["[date]", "ISO8601"]
            target => "[@timestamp]"
            timezone => "UTC"
            remove_field => [ "date" ]
        }
        translate { #[6]
            source => "[http][status][code]"
            target => "[http][status][code_description]"
            dictionary => {
                "200" => "OK"
                "403" => "Permission denied"
                "404" => "Not Found"
                "500" => "Server Error"
            }
            fallback => "Unknown error"
        }
    }
}

output {
    elasticsearch { #[7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog: #[1]
    start_at: beginning
    include:
      - /tmp/demo-line.log
    include_file_name: false
    include_file_path: true
    storage: file_storage 
    operators:
    # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "access_log_otel" 
    - type: add #[2]
      field: attributes["data_stream.namespace"]
      value: "default"

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/grok: #[3]
    log_statements:
      - context: log
        statements:
        - 'merge_maps(attributes, ExtractGrokPatterns(attributes["event.original"], "%{TIMESTAMP_ISO8601:date}: user %{WORD:user.name} accessed from %{IP:client.ip}:%{NUMBER:client.port:int} path %{URIPATH:url.path} with error %{NUMBER:http.status.code}", true), "insert")'

  filter/exclude_system_user:  #[4]
    error_mode: ignore
    logs:
      log_record:
        - attributes["user.name"] == "SYSTEM"

  transform/parse_date: #[5]
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["date"], "%Y-%m-%dT%H:%M:%S"))
          - delete_key(attributes, "date")
        conditions:
          - attributes["date"] != nil

  transform/translate_status_code:  #[6]
    log_statements:
      - context: log
        conditions:
        - attributes["http.status.code"] != nil
        statements:
        - set(attributes["http.status.code_description"], "OK")                where attributes["http.status.code"] == "200"
        - set(attributes["http.status.code_description"], "Permission Denied") where attributes["http.status.code"] == "403"
        - set(attributes["http.status.code_description"], "Not Found")         where attributes["http.status.code"] == "404"
        - set(attributes["http.status.code_description"], "Server Error")      where attributes["http.status.code"] == "500"
        - set(attributes["http.status.code_description"], "Unknown Error")     where attributes["http.status.code_description"] == nil

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - resourcedetection/system
        - transform/grok
        - filter/exclude_system_user
        - transform/parse_date
        - transform/translate_status_code
      exporters:
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-09-20T08:33:27.000Z",
    "client": {
        "ip": "89.66.167.22",
        "port": 10592
    },
    "data_stream": {
        "dataset": "access_log",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "http": {
        "status": {
            "code": "404",
            "code_description": "Not Found"
        }
    },
    "log": {
        "file": {
            "path": "/tmp/demo-line.log"
        }
    },
    "message": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404",
    "url": {
        "path": "/blog"
    },
    "user": {
        "name": "frank"
    }
}

Example 2: Parse and transform a NDJSON-formatted log file

Let's consider the below json line:

{"log_level":"INFO","message":"User login successful","service":"auth-service","timestamp":"2024-10-11 12:34:56.123 +0100","user":{"id":"A1230","name":"john_doe"}}

We will apply the following steps:

Read a line from the file /tmp/demo.ndjson.
Define the output to be an Elasticsearch datastream logs-json-default
Parse the JSON and assign relevant keys and values.
Parse the date.
Override the message field.
Rename fields to follow ECS conventions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo.ndjson" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "json"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
  if [message] =~ /^\{.*/ {
    json { #[3] & #[5]
        source => "message"
    }
  }
  date { #[4]
    match => ["[timestamp]", "yyyy-MM-dd HH:mm:ss.SSS Z"]
    remove_field => "[timestamp]"
  }
  mutate {
    rename => { #[6]
      "service" => "[service][name]"
      "log_level" => "[log][level]"
    }
  }
}


output {
    elasticsearch { # [7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog/json: # [1]
    include: 
      - /tmp/demo.ndjson
    retry_on_failure:
      enabled: true
    start_at: beginning
    storage: file_storage 
    operators:
     # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"      
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "otel" #[2]
    - type: add
      field: attributes["data_stream.namespace"]
      value: "default"     


extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/json_parse:  #[3]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), "upsert")
        conditions: 
          - IsMatch(body, "^\\{")
      

  transform/parse_date:  #[4]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["timestamp"], "%Y-%m-%d %H:%M:%S.%L %z"))
          - delete_key(attributes, "timestamp")
        conditions: 
          - attributes["timestamp"] != nil

  transform/override_message_field: [5]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(body, attributes["message"])
          - delete_key(attributes, "message")

  transform/set_log_severity: # [6]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes["log_level"])          

  attributes/rename_attributes: #[6]
    actions:
      - key: service.name
        from_attribute: service
        action: insert
      - key: service
        action: delete
      - key: log_level
        action: delete

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs/json:
      receivers: 
        - filelog/json
      processors:
        - resourcedetection/system    
        - transform/json_parse
        - transform/parse_date        
        - transform/override_message_field
        - transform/set_log_severity
        - attributes/rename_attributes
      exporters: 
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-10-11T12:34:56.123000000Z",
    "data_stream": {
        "dataset": "otel",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "{\"log_level\":\"WARNING\",\"message\":\"User login successful\",\"service\":\"auth-service\",\"timestamp\":\"2024-10-11 12:34:56.123 +0100\",\"user\":{\"id\":\"A1230\",\"name\":\"john_doe\"}}"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "log": {
        "file": {
            "name": "json.log"
        },
        "level": "WARNING"
    },
    "message": "User login successful",
    "service": {
        "name": "auth-service"
    },
    "user": {
        "id": "A1230",
        "name": "john_doe"
    }
}

Conclusion

In this post, we showed examples of how to convert a typical Logstash pipeline into an OpenTelemetry Collector pipeline for logs. While OpenTelemetry provides powerful tools for collecting and exporting logs, if your pipeline relies on complex transformations or scripting, Logstash remains a superior choice. This is because Logstash offers a broader range of built-in features and a more flexible approach to handling advanced data manipulation tasks.

What's Next?

Now that you've seen basic (but realistic) examples of converting a Logstash pipeline to OpenTelemetry, it's your turn to dive deeper. Depending on your needs, you can explore further and find more detailed resources in the following repositories:

OpenTelemetry Collector: Learn about the core OpenTelemetry components, from receivers to exporters.
OpenTelemetry Collector Contrib: Find community-contributed components for a wider range of integrations and features.
Elastic's opentelemetry-collector-components: Dive into Elastic's extensions for the OpenTelemetry Collector, offering more tailored features for Elastic Stack users.

If you encounter specific challenges or need to handle more advanced use cases, these repositories will be an excellent resource for discovering additional components or integrations that can enhance your pipeline. All these repositories have a similar structure with folders named receiver, processor, exporter, connector, which should be familiar after reading this blog. Whether you are migrating a simple Logstash pipeline or tackling more complex data transformations, these tools and communities will provide the support you need for a successful OpenTelemetry implementation.

Migrating 1 billion log lines from OpenSearch to Elasticsearch

Wed, 11 Oct 2023 00:00:00 GMT

What are the current options to migrate from OpenSearch to Elasticsearch^®?

OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as this benchmark shows (hint: it’s currently much slower than Elasticsearch).

Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.

This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!

1 billion log lines

We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).

We have in total 1,009,165,775 documents that take 453.5GB of space in OpenSearch, including the replicas. That’s 241.2KB per document. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!

This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:

index                              docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049  102519334   1   1         22.1gb     44.2gb
.ds-logs-myapplication-prod-000048  114273539   1   1         26.1gb     52.3gb
.ds-logs-myapplication-prod-000044  111093596   1   1         25.4gb     50.8gb
.ds-logs-myapplication-prod-000043  113821016   1   1         25.7gb     51.5gb
.ds-logs-myapplication-prod-000042  113859174   1   1         24.8gb     49.7gb
.ds-logs-myapplication-prod-000041  112400019   1   1         25.7gb     51.4gb
.ds-logs-myapplication-prod-000040  113362823   1   1         25.9gb     51.9gb
.ds-logs-myapplication-prod-000038  110994116   1   1         25.3gb     50.7gb
.ds-logs-myapplication-prod-000037  116842158   1   1         25.4gb     50.8gb

Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the benchmark, the clusters are running in Kubernetes.

Moving data from A to B

Typically, moving data from one Elasticsearch cluster to another is easy as a snapshot and restore if the clusters are compatible versions of each other or a reindex from remote if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.

Scrolling

Scrolling involves using an external tool, such as Logstash^®, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:

Easy parallelization: It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.
Queuing: Logstash automatically queues documents before sending.
Automatic retries: In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.

A scrolled search takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.

Migration strategies

Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.

Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the monitoring functionality in Elastic^® and then navigating to the index you want to inspect.

Scrolling in the deep
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.

With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!

Slice me nice
We can optimize scrolling by using an approach called Sliced scroll, where we split the index in different slices to consume them independently.

Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.

Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.


SLICES	PAGE_SIZE	WORKERS	BATCH_SIZE	Average Indexing Rate
3	500	3	500	13,319 docs/sec
3	1,000	3	1,000	13,048 docs/sec
4	250	4	250	10,199 docs/sec
4	500	4	500	12,692 docs/sec
4	1,000	4	1,000	10,900 docs/sec
5	500	5	500	12,647 docs/sec
5	1,000	5	1,000	10,334 docs/sec
5	2,000	5	2,000	10,405 docs/sec
10	250	10	250	14,083 docs/sec
10	250	4	1,000	12,014 docs/sec
10	500	4	1,000	10,956 docs/sec

It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.

By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.

Let’s migrate

Preparing our destination indices

We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our index template and index lifecycle management configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.

Index lifecycle management policy
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:

PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "25gb"
          }
        }
      },
      "warm": {
        "min_age": "0d",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      }
    }
  }
}

Index template (and saving 23% in disk space)
Since we are here, we’re going to go ahead and enable Synthetic Source, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.

For our example, enabling Synthetic Source resulted in a remarkable 23.4% improvement in storage efficiency , reducing the size required to store a single document from 241.2KB in OpenSearch to just 185KB in Elasticsearch.

Our full index template is therefore:

PUT _index_template/logs-myapplication-reindex
{
  "index_patterns": [
    "logs-myapplication-reindex"
  ],
  "priority": 500,
  "data_stream": {},
  "template": {
    "settings": {
      "index": {
        "lifecycle.name": "logs-myapplication-lifecycle-policy",
        "codec": "best_compression",
        "number_of_shards": "1",
        "number_of_replicas": "1",
        "query": {
          "default_field": [
            "message"
          ]
        }
      }
    },
    "mappings": {
      "_source": {
        "mode": "synthetic"
      },
      "_data_stream_timestamp": {
        "enabled": true
      },
      "date_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "properties": {
            "ephemeral_id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "aws": {
          "properties": {
            "cloudwatch": {
              "properties": {
                "ingestion_time": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_group": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_stream": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "cloud": {
          "properties": {
            "region": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "data_stream": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "namespace": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "event": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "ingested": {
              "type": "date"
            }
          }
        },
        "host": {
          "type": "object"
        },
        "input": {
          "properties": {
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "log": {
          "properties": {
            "file": {
              "properties": {
                "path": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "message": {
          "type": "match_only_text"
        },
        "meta": {
          "properties": {
            "file": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "metrics": {
          "properties": {
            "size": {
              "type": "long"
            },
            "tmin": {
              "type": "long"
            }
          }
        },
        "process": {
          "properties": {
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "tags": {
          "type": "keyword",
          "ignore_above": 1024
        }
      }
    }
  }
}

Building a custom Logstash image

We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.

Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:8.16.1 and just install the plugin:

FROM docker.elastic.co/logstash/logstash:8.16.1

USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch

Writing a Logstash pipeline

Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.

The input

input {
    opensearch {
        hosts => ["os-cluster:9200"]
        ssl => true
        ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
        user => "${OPENSEARCH_USERNAME}"
        password => "${OPENSEARCH_PASSWORD}"
        index => "${SOURCE_INDEX_NAME}"
        slices => "${SOURCE_SLICES}"
        size => "${SOURCE_PAGE_SIZE}"
        scroll => "5m"
        docinfo => true
        docinfo_target => "[@metadata][doc]"
    }
}

Let’s break down the most important input parameters. The values are all represented as environment variables here:

hosts: Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.
index: Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).
size: Specifies the maximum number of logs to retrieve in each request.
scroll: Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.
docinfo and docinfo_target: These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.

The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl => true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.

The (optional) filter
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as "@version" and "host". We are also removing the original "data_stream" as it contains the source data stream name, which might not be the same in the destination.

filter {
    mutate {
        remove_field => ["@version", "host", "data_stream"]
    }
}

The output
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention -- so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.

elasticsearch {
    hosts => "${ELASTICSEARCH_HOST}"

    user => "${ELASTICSEARCH_USERNAME}"
    password => "${ELASTICSEARCH_PASSWORD}"

    document_id => "%{[@metadata][doc][_id]}"

    data_stream => "true"
    data_stream_type => "logs"
    data_stream_dataset => "myapplication"
    data_stream_namespace => "prod"
}

Deploying Logstash

We have a few options to deploy Logstash: it can be deployed locally from the command line, as a systemd service, via docker, or on Kubernetes.

Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a Pod referencing our Docker image created earlier. Let’s put our pipeline inside a ConfigMap along with some configuration files (pipelines.yml and config.yml).

In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.

apiVersion: v1
kind: Pod
metadata:
  name: logstash-1
spec:
  containers:
    - name: logstash
      image: ugosan/logstash-opensearch-input:8.10.0
      imagePullPolicy: Always
      env:
        - name: SOURCE_INDEX_NAME
          value: ".ds-logs-benchmark-dev-000037"
        - name: SOURCE_SLICES
          value: "10"
        - name: SOURCE_PAGE_SIZE
          value: "500"
        - name: LOGSTASH_WORKERS
          value: "4"
        - name: LOGSTASH_BATCH_SIZE
          value: "1000"
        - name: OPENSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: username
        - name: OPENSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: password
        - name: ELASTICSEARCH_USERNAME
          value: "elastic"
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: es-cluster-es-elastic-user
              key: elastic
      resources:
        limits:
          memory: "4Gi"
          cpu: "2500m"
        requests:
          memory: "1Gi"
          cpu: "300m"
      volumeMounts:
        - name: config-volume
          mountPath: /usr/share/logstash/config
        - name: etc
          mountPath: /etc/logstash
          readOnly: true
  volumes:
    - name: config-volume
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipelines.yml
                  path: pipelines.yml
                - key: logstash.yml
                  path: logstash.yml
    - name: etc
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipeline.conf
                  path: pipelines/pipeline.conf
          - secret:
              name: os-cluster-http-cert
              items:
                - key: ca.crt
                  path: certificates/opensearch-ca.crt
          - secret:
              name: es-cluster-es-http-ca-internal
              items:
                - key: tls.crt
                  path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-configmap
data:
  pipelines.yml: |
    - pipeline.id: reindex-os-es
      path.config: "/etc/logstash/pipelines/pipeline.conf"
      pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
      pipeline.workers: ${LOGSTASH_WORKERS}
  logstash.yml: |
    log.level: info
    pipeline.unsafe_shutdown: true
    pipeline.ordered: false
  pipeline.conf: |
    input {
        opensearch {
          hosts => ["os-cluster:9200"]
          ssl => true
          ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
          user => "${OPENSEARCH_USERNAME}"
          password => "${OPENSEARCH_PASSWORD}"
          index => "${SOURCE_INDEX_NAME}"
          slices => "${SOURCE_SLICES}"
          size => "${SOURCE_PAGE_SIZE}"
          scroll => "5m"
          docinfo => true
          docinfo_target => "[@metadata][doc]"
        }
    }

    filter {
        mutate {
            remove_field => ["@version", "host", "data_stream"]
        }
    }

    output {
        elasticsearch {
            hosts => "https://es-cluster-es-http:9200"
            ssl => true
            ssl_certificate_authorities => ["/etc/logstash/certificates/elasticsearch-ca.crt"]
            ssl_verification_mode => "full"

            user => "${ELASTICSEARCH_USERNAME}"
            password => "${ELASTICSEARCH_PASSWORD}"

            document_id => "%{[@metadata][doc][_id]}"

            data_stream => "true"
            data_stream_type => "logs"
            data_stream_dataset => "myapplication"
            data_stream_namespace => "reindex"
        }
    }

That’s it.

After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like Automatically Categorize those logs, but also extract business metrics and detect anomalies on them, give it a try.


OpenSearch			Elasticsearch
Index	docs	size	Index	docs	size	Diff.
.ds-logs-myapplication-prod-000037	116842158	27285520870	logs-myapplication-reindex-000037	116842158	21998435329	21.46%
.ds-logs-myapplication-prod-000038	110994116	27263291740	logs-myapplication-reindex-000038	110994116	21540011082	23.45%
.ds-logs-myapplication-prod-000040	113362823	27872438186	logs-myapplication-reindex-000040	113362823	22234641932	22.50%
.ds-logs-myapplication-prod-000041	112400019	27618801653	logs-myapplication-reindex-000041	112400019	22059453868	22.38%
.ds-logs-myapplication-prod-000042	113859174	26686723701	logs-myapplication-reindex-000042	113859174	21093766108	23.41%
.ds-logs-myapplication-prod-000043	113821016	27657006598	logs-myapplication-reindex-000043	113821016	22059454752	22.52%
.ds-logs-myapplication-prod-000044	111093596	27281936915	logs-myapplication-reindex-000044	111093596	21559513422	23.43%
.ds-logs-myapplication-prod-000048	114273539	28111420495	logs-myapplication-reindex-000048	114273539	22264398939	23.21%
.ds-logs-myapplication-prod-000049	102519334	23731274338	logs-myapplication-reindex-000049	102519334	19307250001	20.56%

Interested in trying Elasticsearch? Start our 14-day free trial.

Monitor dbt pipelines with Elastic Observability

Fri, 26 Jul 2024 00:00:00 GMT

In the Data Analytics team within the Observability organization in Elastic, we use dbt (dbt™, data build tool) to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use dbt core, the open-source project, where you can develop from the command line and run your dbt project.

Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.

There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.

We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.

The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.

Why monitor dbt pipelines with Elastic?

With every invocation, dbt generates and saves one or more JSON files called artifacts containing log data on the invocation results. dbt run and dbt test invocation logs are stored in the file run_results.json, as per the dbt documentation:

This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Monitoring dbt run invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the dbt run logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.

Monitoring dbt test invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the dbt test logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the dbt run logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.

Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.

We can also correlate this information with other events ingested into Elastic, for example using the Elastic Github connector, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.

How to export dbt invocation logs to Elasticsearch

We use the Python Elasticsearch client to send the dbt invocation logs to Elastic after we run our dbt run and dbt test processes daily in production. The setup just requires you to install the Elasticsearch Python client and obtain your Elastic Cloud ID (go to https://cloud.elastic.co/deployments/, select your deployment and find the Cloud ID) and Elastic Cloud API Key (following this guide)

This python helper function will index the results from your run_results.json file to the specified index. You just need to export the variables to the environment:

RESULTS_FILE: path to your run_results.json file
DBT_RUN_LOGS_INDEX: the name you want to give to dbt run logs index in Elastic, e.g. dbt_run_logs
DBT_TEST_LOGS_INDEX: the name you want to give to the dbt test logs index in Elastic, e.g. dbt_test_logs
ES_CLUSTER_CLOUD_ID
ES_CLUSTER_API_KEY

Then call the function log_dbt_es from your python code or save this code as a python script and run it after executing your dbt run or dbt test commands:

from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ["RESULTS_FILE"]
   DBT_RUN_LOGS_INDEX = os.environ["DBT_RUN_LOGS_INDEX"]
   DBT_TEST_LOGS_INDEX = os.environ["DBT_TEST_LOGS_INDEX"]
   es_cluster_cloud_id = os.environ["ES_CLUSTER_CLOUD_ID"]
   es_cluster_api_key = os.environ["ES_CLUSTER_API_KEY"]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f"ERROR: {RESULTS_FILE} No dbt run results found.")
       sys.exit(1)


   with open(RESULTS_FILE, "r") as json_file:
       results = json.load(json_file)
       timestamp = results["metadata"]["generated_at"]
       metadata = results["metadata"]
       elapsed_time = results["elapsed_time"]
       args = results["args"]
       docs = []
       for result in results["results"]:
           if result["unique_id"].split(".")[0] == "test":
               result["_index"] = DBT_TEST_LOGS_INDEX
           else:
               result["_index"] = DBT_RUN_LOGS_INDEX
           result["@timestamp"] = timestamp
           result["metadata"] = metadata
           result["elapsed_time"] = elapsed_time
           result["args"] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return "Done"

# Call the function
log_dbt_es()

If you want to add/remove any other fields from run_results.json, you can modify the above function to do it.

Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.

Go to Discover, click on the data view selector on the top left and “Create a data view”.

Now you can create a data view with your preferred name. Do this for both dbt run (DBT_RUN_LOGS_INDEX in your code) and dbt test (DBT_TEST_LOGS_INDEX in your code) indices:

Going back to Discover, you’ll be able to select the Data Views and explore the data.

dbt run alerts, dashboards and ML jobs

The invocation of dbt run executes compiled SQL model files against the current database. dbt run invocation logs contain the following fields:

unique_id: Unique model identifier
execution_time: Total time spent executing this model run

The logs also contain the following metrics about the job execution from the adapter:

adapter_response.bytes_processed
adapter_response.bytes_billed
adapter_response.slot_ms
adapter_response.rows_affected

We have used Kibana to set up Anomaly Detection jobs on the above-mentioned metrics. You can configure a multi-metric job split by unique_id to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use this shortcut to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can view the jobs and add them to a dashboard using the three dots button in the anomaly timeline:

We have used the ML job to set up alerts that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning > Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:

We also use Kibana dashboards to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.

dbt test alerts and dashboards

You may already be familiar with tests in dbt, but if you’re not, dbt data tests are assertions you make about your models. Using the command dbt test, dbt will tell you if each test in your project passes or fails. Here is an example of how to set them up. In our team, we use out-of-the-box dbt tests (unique, not_null, accepted_values, and relationships) and the packages dbt_utils and dbt_expectations for some extra tests. When the command dbt test is run, it generates logs that are stored in run_results.json.

dbt test logs contain the following fields:

unique_id: Unique test identifier, tests contain the “test” prefix in their unique identifier
status: result of the test, pass or fail
execution_time: Total time spent executing this test
failures: will be 0 if the test passes and 1 if the test fails
message: If the test fails, reason why it failed

The logs also contain the metrics about the job execution from the adapter.

We have set up alerts on document count (see guide) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on status:fail to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0. Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:

We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:

Finding Root Causes with the AI Assistant

The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.

Conclusion

As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.

dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.

Monitor your Python data pipelines with OTEL

Thu, 08 Aug 2024 00:00:00 GMT

This article delves into how to implement observability practices, particularly using OpenTelemetry (OTEL) in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.

Introduction

Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.

In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into Google BigQuery (BQ). This processed data then feeds into DBT (Data Build Tool) models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see Monitor your DBT pipelines with Elastic Observability. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.

The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.

Motivation

Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:

Data Quality Control:

Detecting anomalies in the data, such as unexpected drops in record counts.
Verifying that data transformations are applied correctly and consistently.
Ensuring the integrity and accuracy of the data loaded into the data warehouse.

Performance Monitoring:

Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.
Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.

Real-time Alerting:

Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.
Identify the root case of such incidents
Proactively addressing incidents to minimize downtime and impact on business operations

Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.

Steps for Instrumentation

Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.

Step 1: Import Required Libraries

We first need to install the following libraries.

pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]

You can also them to your project's requirements.txt file and install them with pip install -r requirements.txt.

Explanation of Dependencies

elastic-opentelemetry: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:
- opentelemetry-distro: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.
- opentelemetry-exporter-otlp: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.
- opentelemetry-instrumentation-system-metrics: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.
google-cloud-bigquery[opentelemetry]: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.

Step 2: Export OTEL Variables

Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.

Go to APM -> Services -> Add data (top left corner).

In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.

Find OTLP Endpoint:

Look for the section related to OpenTelemetry or OTLP configuration.
The OTEL_EXPORTER_OTLP_ENDPOINT is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like https:///otlp.

Obtain OTLP Headers:

In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.
Copy the necessary headers provided by the interface. They might look like Authorization: Bearer .

Note: Notice you need to replace the whitespace between Bearer and your token with %20 in the OTEL_EXPORTER_OTLP_HEADERS variable when using Python.

Alternatively you can use a different approach for authentication using API keys (see instructions). If you are using our serverless offering you will need to use this approach instead.

Set up the variables:

Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command source env.sh.

Below is a script to set these variables:

#!/bin/bash
echo "--- :otel: Setting OTEL variables"
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER="otlp,console"

With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.

Explanation of Variables

OTEL_EXPORTER_OTLP_ENDPOINT: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace placeholder with your actual OTLP endpoint.
OTEL_EXPORTER_OTLP_HEADERS: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace placeholder with your actual OTLP headers.
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.
OTEL_PYTHON_LOG_CORRELATION: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.
OTEL_METRIC_EXPORT_INTERVAL: This variable specifies the metric export interval in milliseconds, in this case 5s.
OTEL_LOGS_EXPORTER: This variable specifies the exporter to use for logs. Setting it to "otlp" means that logs will be exported using the OTLP protocol. Adding "console" specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.
ELASTIC_OTEL_SYSTEM_METRICS_ENABLED: It is needed to use this variable when using the Elastic distribution as by default it is set to false.

Note: OTEL_METRICS_EXPORTER and OTEL_TRACES_EXPORTER: This variables specify the exporter to use for metrics/traces, and are set to "otlp" by default, which means that metrics and traces will be exported using the OTLP protocol.

Running Python ETLs

We run Python ETLs with the following command:

OTEL_RESOURCE_ATTRIBUTES="service.name=x-ETL,service.version=1.0,deployment.environment=production" && opentelemetry-instrument python3 X_ETL.py

Explanation of the Command

OTEL_RESOURCE_ATTRIBUTES: This variable specifies additional resource attributes, such as service name, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.
opentelemetry-instrument: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.
python3 X_ETL.py: This runs the specified Python script (X_ETL.py).

Tracing

We export the traces via the default OTLP protocol.

Tracing is a key aspect of monitoring and understanding the performance of applications. Spans form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.

Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.

With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:

from opentelemetry import trace

if __name__ == "__main__":

    tracer = trace.get_tracer("main")
    with tracer.start_as_current_span("initialization") as span:
            # Init code
            … 
    with tracer.start_as_current_span("search") as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span("transform") as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span("load") as span:
           # Step 3 - Load code
           …

You can explore traces in the APM interface as shown below.

Metrics

We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.

Note: Remember to set ELASTIC_OTEL_SYSTEM_METRICS_ENABLED to true.

Logging

We export logs via the default OTLP protocol as well.

For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:

        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # "slot_time_ms": job_details.slot_ms,
            "job_id": job_details.job_id,
            "job_type": job_details.job_type,
            "state": job_details.state,
            "path": job_details.path,
            "job_created": job_details.created.isoformat(),
            "job_ended": job_details.ended.isoformat(),
            "execution_time_ms": (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            "bytes_processed": job_details.output_bytes,
            "rows_affected": job_details.output_rows,
            "destination_table": job_details.destination.table_id,
            "event": "BigQuery Load Job", # Custom event type
            "status": "success", # Status of the step (success/error)
            "category": category # ETL category tag 
        }

        logging.info("BigQuery load operation successful", extra=bq_fields)

This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.

Any calls to logging (of all levels above the set threshold, in this case INFO logging.getLogger().setLevel(logging.INFO)) will create a log that will be exported to Elastic. This means that in Python scripts that already use logging there is no need to make any changes to export logs to Elastic.

For each of the log messages, you can go into the details view (click on the … when you hover over the log line and go into View details) to examine the metadata attached to the log message. You can also explore the logs in Discover.

Explanation of Logging Modification

logging.info: This logs an informational message. The message "BigQuery load operation successful" is logged.
extra=bq_fields: This adds additional context to the log entry using the bq_fields dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.

Monitoring in Elastic's APM

As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.

Rules and Alerts

We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.

The error count threshold rule is used to create a trigger when the number of errors in a service exceeds a defined threshold.

To create the rule go to Alerts and Insights -> Rules -> Create Rule -> Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.

Next, we create a rule of type custom threshold on a given ETL logs data view (create one for your index) filtering on "labels.status: error" to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count > 0. In our case, in the last section of the rule config, we also set up Slack alerts every time the rule is activated. You can pick from a long list of connectors Elastic supports.

Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via labels.status.

logging.info(
            "Elasticsearch search operation successful",
            extra={
                "event": "Elasticsearch Search",
                "status": "success",
                "category": category,
                "index": index,
            },
        )

More Rules

We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -> Alerts and rules -> Custom threshold rule -> Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.

Alternatively, for finer-grained control, you can go with Alerts and rules -> Anomaly rule, set up an anomaly job, and pick a threshold severity level.

Anomaly detection job

In this example we set an anomaly detection job on the number of documents before transform.

We set up an Anomaly Detection jobs on the number of document before the transform using the [Single metric job] (https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs) to detect any anomalies with the incoming data source.

In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.

Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.

You can go to your custom dashboard by going to Add Panel -> ML -> Anomaly Swim Lane -> Pick your job.

Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the execution_time_ms, bytes_processed and rows_affected similarly to how it was done in Monitor your DBT pipelines with Elastic Observability.

Custom Dashboard

Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on labels.event (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.

Conclusion

Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:

Logging of new logs (no need to add custom logging) alongside their execution context
Monitor the runtime behavior of our models
Track data quality issues
Identify and troubleshoot real-time incidents
Optimize performance bottlenecks and resource usage
Identify dependencies on other services and their latency
Optimize data transformation processes
Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)

With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.

In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.

NGNIX log analytics with GenAI in Elastic

Fri, 05 Jul 2024 00:00:00 GMT

Elastic Observability provides a full observability solution, supporting metrics, traces, and logs for applications and infrastructure. NGINX, which is highly used for web serving, load balancing, http caching, and reverse proxy, is the key to many applications and outputs a large volume of logs. NGINX’s access logs, which detail all requests made to the NGINX server, and error logs which record server-related issues and problems are key to managing and analyzing NGINX issues along with understanding what is happening to your application.

In managing NGINX Elastic provides several capabilities:

Easy ingest, parsing, and out-of-the-box dashboards. Check out the simple how-to in our docs. Based on logs, these dashboards show several items over time, response codes, errors, top pages, data volume, browsers used, active connections, drop rates, and much more.
Out-of-the-box ML-based anomaly detection jobs for your NGINX logs. These jobs help pinpoint anomalies against request rates, IP address request rates, URL access, status codes, and visitor rate anomalies.
ES|QL which helps work through logs and build out charts during analysis.
Elastic’s GenAI Assistant provides a simple natural language interface that helps analyze all the logs and can pull out issues from ML jobs and even create dashboards. The Elastic AI Assistant also automatically uses ES|QL.
NGINX SLOs - Finally Elastic provides the ability to define and monitor SLOs for your NGINX logs. While most SLOs are metrics-based, Elastic allows you to create logs-based SLOs. We detailed this in a previous blog.

NGINX logs are another example of why logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting and NGINX is usually the starting point for most analyses.

In today’s blog, we’ll cover how the out-of-the-box ML-based anomaly detection jobs can help RCA, and how Elastic’s GenAI Assistant helps easily work through logs to pinpoint issues in minutes.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Bring up an NGINX server on a host. OR run an application with NGINX as a front end and drive traffic.
Install the NGINX integration and assets and review the dashboards as noted in the docs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

In our scenario, we use data from 3 months from our Elastic environment to help highlight the features. Hence you might need to run your application with traffic for a specific time frame to follow along.

Analyzing the issues with AI Assistant

As detailed in a previous blog, you can get alerted on issues via SLO monitoring against NGINX logs. Let’s assume you have an SLO based on status codes as we outlined in the previous blog. You can immediately analyze the issue via the AI Assistant. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo)

AI Assistant analysis:

Using lens graph all http response status codes < 400 and > =400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - We wanted to simply understand the amount of requests resulting in status code >= 400 and graph the results. We see that 15% of the requests were not successful, hence an SLO alert being triggered.
Which ip address (field source.adress) has the highest number of http.response.status.code >= 400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - We were curious is there was a specific IP address not having successful requests. 72.57.0.53, with a count of 25,227 occurrences is daily high but not the ensure 2 failed requests.
What country (source.geo.country_iso_code) is source.address=72.57.0.53 coming from. Use filebeat-nginx-elasticco-anon-2017. - Again we were curious if this came from a specific country. And the IP address 72.57.0.53 is coming from the country with the ISO code IN, which corresponds to India. Nothing out of the ordinary.
Did source.address=72.57.0.53 have any (http.response.status.code < 400) from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer - Oddly the IP address in question only had 4000+ successful responses. Meaning its not malicious, and points to something else.
What are the different status codes (http.response.status.code>=400), from source.address=72.57.0.53. Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code - We are curious whether or not we see any 502, which there were none, but most of the failures were 404.
What are the different status codes (http.response.status.code>=400). Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code - Regardless of a specific address, what is the largest number of status code occurrences > 400. This also points to 404.
What does a high 404 count from a specific IP address mean from NGINX logs? - Asking this question, we need to understand the potential causes of this from our application. From the answers, we can rule out security probing and web scraping, as we validated that a specific address 72.57.0.53 has a low non-success request status code. It also rules out User error. Hence this points potentially to Broken Links or Missing Resources.

Watch the flow:

Potential issue:

It seems that we potentially have an issue with the backend serving specific answers or having issues with resources (database, or broken links). This is cursing the higher-than-normal non-successful status codes>=400.

Key highlights from AI Assistant:

As you watched this video you will notice a few things:

We analyzed millions of logs in a matter of minutes using a set of simple natural language queries.
We didn’t need to know any special query language. The AI Assistant used Elastic’s ES|QL but can similarly use KQL also.
The AI Assistant easily builds out graphs
The AI Assistant is accessing and using internal information stored in Elastic’s indices. Vs a simple “google foo” based AI Assistant. This is enabled through RAG, and the AI Assistant can also bring up known issues in github, runbooks, and other useful internal information.

Check out the following blog on how the AI Assistant uses RAG to retrieve internal information. Specifically using github and runbooks.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of NGINX log management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.When using NGINX, there are several out-of-the-box anomaly detection jobs. These work specifically on NGINX access logs.

Low_request_rate_nginx - Detect low request rates
Source_ip_request_rate_nginx - Detect unusual source IPs - high request rates
Source_ip_url_count_nginx - Detect unusual source IPs - high distinct count of URLs
Status_code_rate_nginx - Detect unusual status code rates
Visitor_rate_nginx - Detect unusual visitor rates

Being right out of the box, lets look at the job - Status_code_rate_nginx, which is related to our previous analysis.

With a few simple clicks we immediately get an analysis showing a specific IP address - 72.57.0.53, having higher than normal non-successful requests. Oddly we also found this is using the AI Assistant.

We can take this further with conversations with the AI Assistant, look at the logs, and/or even look at the other ML anomaly jobs.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze NGINX logs without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO).

Check out other resources on NGINX logs:

Out-of-the-box anomaly detection jobs for NGINX

Using the NGINX integration to ingest and analyze NGINX Logs

NGINX Logs based SLOs in Elastic

Using GitHub issues, runbooks, and other internal information for RCAs with Elastic’s RAG based AI Assistant

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

Root cause analysis with logs: Elastic Observability's AIOps Labs

Thu, 27 Apr 2023 00:00:00 GMT

In the previous blog in our root cause analysis with logs series, we explored how to analyze logs in Elastic Observability with Elastic’s anomaly detection and log categorization capabilities. Elastic’s platform enables you to get started on machine learning (ML) quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.

Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To get you started, there are several key features built into Elastic Observability to aid in analysis, bypassing the need to run specific ML models. These features help minimize the time and analysis of logs.

Let’s review the set of machine learning-based observability features in Elastic:

Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.

Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action more quickly.

High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. Read APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions for an overview of this capability.

AIOps Labs: AIOps Labs provides two main capabilities using advanced statistical methods:

Log spike detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.

As we showed in the last blog, using machine learning-based features helps minimize the extremely tedious and time-consuming process of analyzing data using traditional methods, such as alerting and simple pattern matching (visual or simple searching, etc.). Trying to find the needle in the haystack requires the use of some level of artificial intelligence due to the increasing amounts of telemetry data (logs, metrics, and traces) being collected across ever-growing applications.

In this blog post, we’ll cover two capabilities found in Elastic’s AIOps Labs: log spike detector and log pattern analysis. We’ll use the same data from the previous blog and analyze it using these two capabilities.

_ We will cover log spike detector and log pattern analysis against the popular Hipster Shop app developed by Google, and modified recently by OpenTelemetry. _

Overviews of high-latency capabilities can be found here, and an overview of AIOps labs can be found here.

Below, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Utilize a version of the popular Hipster Shop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the OpenTelemetry Demo App. The Elastic version is found here.
Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: Independence with OTel in Elastic and Observability and Security with OTel in Elastic. Additionally, review the OTel documentation in Elastic.
Look through an overview of Elastic Observability APM capabilities.
Look through our anomaly detection documentation for logs and log categorization documentation.

Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:

In our example, we’ve introduced issues to help walk you through the root cause analysis features. You might have a different set of issues depending on how you load the application and/or introduce specific feature flags.

As part of the walk-through, we’ll assume we are DevOps or SRE managing this application in production.

Root cause analysis

While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer-related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.

How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:

Log spike analysis
Log pattern analysis

While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.

Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.

In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. Rather than jump into anomaly detection (see previous blog), let’s look at some of the potential issues by reviewing the service details in APM.

What we see for the productCatalogService is that there are latency issues, failed transactions, a large number of issues, and a dependency to PostgreSQL. When we look at the errors in more detail and drill down, we see they are all coming from PQ - which is a PostgreSQL driver in Go.

As we drill further, we still can’t tell why the productCatalogService is not able to pull information from the PostgreSQL database.

We see that there is a spike in errors, so let's see if we can gleam further insight using one of our two options:

Log rate spikes
Log pattern analysis

Log rate spikes

Let’s start with the log rate spikes detector capability from Elastic’s AIOps Labs section of Elastic’s machine learning capabilities. We also pre-select analyzing the spike against a baseline history.

The log rate spikes detector has looked at all the logs from the spike and compared them to the baseline, and it's seeing higher-than-normal counts in specific log messages. From a visual inspection, we see that PostgreSQL log messages are high. We further filter this with postgres.

We immediately notice that this issue is potentially caused by pgbench, a popular PostgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes a heavy load on the database host, likely causing higher latency issues on the site.

While this may or may not be the ultimate root cause, we have rather quickly identified a potential issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Log pattern analysis

Instead of log rate spikes, let’s use log pattern analysis to investigate the spike in errors we saw in productCatalogService. In AIOps Labs, we simply select Log Pattern Analysis, use Logs data, filter the results with postgres (since we know it's related to PostgreSQL), and look at information from the message field of the logs we are processing. We see the following:

Almost immediately we see the biggest pattern it finds is a log message where pgbench is updating the database. We can further directly drill into this log message from log pattern analysis into Discover and review the details and further analyze the messages.

As we mentioned in the previous section, while it may or may not be the root cause, it quickly gives us a place to start and a potential root cause. A developer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Conclusion

Between the first blog and this one, we’ve shown how Elastic Observability can help you further identify and get closer to pinpointing the root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of what you learned in this blog.

Elastic Observability has numerous capabilities to help you reduce your time to find the root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities (found in AIOps Labs in Elastic) in this blog:
1. Log rate spikes detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
2. Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.
You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which helps drive these features) or having to do any lengthy setups.

Ready to get started? Register for Elastic Cloud and try out the features and capabilities outlined above.

Additional logging resources:

Common use case examples with logs:

Elastic and Elasticsearch are trademarks, logos or registered trademarks of Elasticsearch B.V. in the United States and other countries.

Monitoring service performance: An overview of SLA calculation for Elastic Observability

Mon, 24 Apr 2023 00:00:00 GMT

Elastic Stack provides many valuable insights for different users. Developers are interested in low-level metrics and debugging information. SREs are interested in seeing everything at once and identifying where the root cause is. Managers want reports that tell them how good service performance is and if the service level agreement (SLA) is met. In this post, we’ll focus on the service perspective and provide an overview of calculating an SLA.

Since version 8.8, we have a built in functionality to calculate SLOs — check out our guide!

Foundations of calculating an SLA

There are many ways to calculate and measure an SLA. The most important part is the definition of the SLA, and as a consultant, I’ve seen many different ways. Some examples include:

Count of HTTP 2xx must be above 98% of all HTTP status
Response time of successful HTTP 2xx requests must be below x milliseconds
Synthetic monitor must be up at least 99%
95% of all batch transactions from the billing service need to complete within 4 seconds

Depending on the origin of the data, calculating the SLA can be easier or more difficult. For uptime (Synthetic Monitoring), we automatically provide SLA values and offer out-of-the-box alerts to simply define alert when availability below 98% for the last 1 hour.

I personally recommend using Elastic Synthetic Monitoring whenever possible to monitor service performance. Running HTTP requests and verifying the answers from the service, or doing fully fledged browser monitors and clicking through the website as a real user does, ensures a better understanding of the health of your service.

Sometimes this is impossible because you want to calculate the uptime of a specific Windows Service that does not offer any TCP port or HTTP interaction. Here the caveat applies that just because the service is running, it does not necessarily imply that the service is working fine.

Transforms to the rescue

We have identified our important service. In our case, it is the Steam Client Helper. There are two ways to solve this.

Lens formula

You can use Lens and formula (for a deep dive into formulas, check out this blog). Use the Search bar to filter down the data you want. Then use the formula option in Lens. We are dividing all counts of records with Running as a state and dividing it by the overall count of records. This is a nice solution when there is a need to calculate quickly and on the fly.

count(kql='windows.service.state: "Running" ')/count()

Using the formula posted above as the bar chart's vertical axis calculates the uptime percentage. We use an annotation to mark why there is a dip and why this service was below the threshold. The annotation is set to reboot, which indicates a reboot happening, and thus, the service was down for a moment. Lastly, we add a reference line and set this to our defined threshold at 98%. This ensures that a quick look at the visualization allows our eyes to gauge if we are above or below the threshold.

Transform

What if I am not interested in just one service, but there are multiple services needed for your SLA? That is where Transforms can solve this problem. Furthermore, the second issue is that this data is only available inside the Lens. Therefore, we cannot create any alerts on this.

Go to Transforms and create a pivot transform.

Add the following filter to narrow it to only services data sets: data_stream.dataset: "windows.service". If you are interested in a specific service, you can always add it to the search bar if you want to know if a specific remote management service is up in your entire fleet!
Select data histogram(@timestamp) and set it to your chosen unit. By default, the Elastic Agent only collects service states every 60 seconds. I am going with 1 hour.
Select agent.name and windows.service.name as well.

Now we need to define an aggregation type. We will use a value_count of windows.service.state. That just counts how many records have this value.

Rename the value_count to total_count.
Add value_count for windows.service.state a second time and use the pencil icon to edit it to terms, which aggregates for running.

This opens up a sub-aggregation. Once again, select value_count(windows.service.state) and rename it to values.
Now, the preview shows us the count of records with any states and the count of running.

Here comes the tricky part. We need to write some custom aggregations to calculate the percentage of uptime. Click on the copy icon next to the edit JSON config.
In a new tab, go to Dev Tools. Paste what you have in the clipboard.
Press the play button or use the keyboard shortcut ctrl+enter/cmd+enter and run it. This will create a preview of what the data looks like. It should give you the same information as in the table preview.
Now, we need to calculate the percentage of up, which means doing a bucket script where we divide running.values by total_count, just like we did in the Lens visualization. Suppose you name the columns differently or use more than a single value. In that case, you will need to adapt accordingly.

"availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }

This is the entire transform for me:

POST _transform/_preview
{
  "source": {
    "index": [
      "metrics-*"
    ]
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      "agent.name": {
        "terms": {
          "field": "agent.name"
        }
      },
      "windows.service.name": {
        "terms": {
          "field": "windows.service.name"
        }
      }
    },
    "aggregations": {
      "total_count": {
        "value_count": {
          "field": "windows.service.state"
        }
      },
      "running": {
        "filter": {
          "term": {
            "windows.service.state": "Running"
          }
        },
        "aggs": {
          "values": {
            "value_count": {
              "field": "windows.service.state"
            }
          }
        }
      },
      "availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }
    }
  }
}

The preview in Dev Tools should work and be complete. Otherwise, you must debug any errors. Most of the time, it is the bucket script and the path to the values. You might have called it up instead of running. This is what the preview looks like for me.

{
  "running": {
    "values": 1
  },
  "agent": {
    "name": "AnnalenasMac"
  },
  "@timestamp": "2021-12-07T19:00:00.000Z",
  "total_count": 1,
  "availability": 1,
  "windows": {
    "service": {
      "name": "InstallService"
    }
  }
},

Now we only paste the bucket script into the transform creation UI after selecting Edit JSON. It looks like this:

Give your transform a name, set the destination index, and run it continuously. When selecting this, please also make sure not to use @timestamp. Instead, opt for event.ingested. Our documentation explains this in detail.

Click next and create and start. This can take a bit, so don’t worry.

To summarize, we have now created a pivot transform using a bucket script aggregation to calculate the running time of a service in percentage. There is a caveat because Elastic Agent, per default, only collects the every 60 seconds the services state. It can be that a service is up exactly when collected and down a few seconds later. If it is that important and no other monitoring possibilities, such as Elastic Synthetics are possible, you might want to reduce the collection time on the Agent side to get the services state every 30 seconds, 45 seconds. Depending on how important your thresholds are, you can create multiple policies having different collection times. This ensures that a super important server might collect the services state every 10 seconds because you need as much granularity and insurance for the correctness of the metric. For normal workstations where you just want to know if your remote access solution is up the majority of the time, you might not mind having a single metric every 60 seconds.

After you have created the transform, one additional feature you get is that the data is stored in an index, similar to in Elasticsearch. When you just do the visualization, the metric is calculated for this visualization only and not available anywhere else. Since this is now data, you can create a threshold alert to your favorite connection (Slack, Teams, Service Now, Mail, and so many more to choose from).

Visualizing the transformed data

The transform created a data view called windows-service. The first thing we want to do is change the format of the availability field to a percentage. This automatically tells Lens that this needs to be formatted as a percentage field, so you don’t need to select it manually as well as do calculations. Furthermore, in Discover, instead of seeing 0.5 you see 50%. Isn’t that cool? This is also possible for durations, like event.duration if you have it as nanoseconds! No more calculations on the fly and thinking if you need to divide by 1,000 or 1,000,000.

We get this view by using a simple Lens visualization with a timestamp on the vertical axis with the minimum interval for 1 day and an average of availability. Don’t worry — the other data will be populated once the transformation finishes. We can add a reference line using the value 0.98 because our target is 98% uptime of the service.

Summary

This blog post covered the steps needed to calculate the SLA for a specific data set in Elastic Observability, as well as how to visualize it. Using this calculation method opens the door to a lot of interesting use cases. You can change the bucket script and start calculating the number of sales, and the average basket size. Interested in learning more about Elastic Synthetics? Read our documentation or check out our free Synthetic Monitoring Quick Start training.

Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator

Tue, 16 Jan 2024 00:00:00 GMT

This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (ECS) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.

Why use OpenShift Logging Operator?

A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:

It is developed and supported by Red Hat
It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible
It can speed up developing life cycles with features like source to image
It uses enhanced security

In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the official Elasticsearch® Documentation). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.

Which logs are we going to collect?

In OpenShift Container Platform, we distinguish three broad categories of logs: audit, application, and infrastructure logs:

Audit logs describe the list of activities that affected the system by users, administrators, and other components.
Application logs are composed of the container logs of the pods running in non-reserved namespaces.
Infrastructure logs are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.

In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.

Getting started

To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.

Inside Elasticsearch

We first install the Kubernetes integration assets. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.

To format the logs received from the ClusterLogForwarder in ECS format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to Exported fields | Logging | OpenShift Container Platform 4.14. To get a list of exported fields of the Kubernetes integration, you can refer to Kubernetes fields | Filebeat Reference [8.11] | Elastic and Logs app fields | Elastic Observability [8.11]. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.

PUT _ingest/pipeline/openshift-2-ecs
{
  "processors": [
    {
      "rename": {
        "field": "kubernetes.pod_id",
        "target_field": "kubernetes.pod.uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_ip",
        "target_field": "kubernetes.pod.ip",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_name",
        "target_field": "kubernetes.pod.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_name",
        "target_field": "kubernetes.namespace",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_id",
        "target_field": "kubernetes.namespace_uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_id",
        "target_field": "container.id",
        "ignore_missing": true
      }
    },
    {
      "dissect": {
        "field": "container.id",
        "pattern": "%{container.runtime}://%{container.id}",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_image",
        "target_field": "container.image.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.container.image",
        "copy_from": "container.image.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "copy_from": "kubernetes.container_name",
        "field": "container.name",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_name",
        "target_field": "kubernetes.container.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "level",
        "target_field": "log.level",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "file",
        "target_field": "log.file.path",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "dissect": {
        "field": "kubernetes.pod_owner",
        "pattern": "%{_tmp.parent_type}/%{_tmp.parent_name}",
        "ignore_missing": true
      }
    },
    {
      "lowercase": {
        "field": "_tmp.parent_type",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.pod.{{_tmp.parent_type}}.name",
        "value": "{{_tmp.parent_name}}",
        "if": "ctx?._tmp?.parent_type != null",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": [
          "_tmp",
          "kubernetes.pod_owner"
          ],
          "ignore_missing": true
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes annotations",
        "if": "ctx?.kubernetes?.annotations != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes namespace_labels",
        "if": "ctx?.kubernetes?.namespace_labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version",
        "if": "ctx?.kubernetes?.labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith("app_kubernetes_io_component_")) {
            def sanitizedKey = k.replace("app_kubernetes_io_component_", "app_kubernetes_io_component/");
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        """
      }
    }
    ]
}

Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:

PUT _ingest/pipeline/openshift-audit-2-ecs
{
  "processors": [
    {
      "script": {
        "source": """
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 && !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=["audit":audit];
        """,
        "description": "Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object"
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "script": {
        "if": "ctx?.kubernetes?.audit?.annotations != null",
        "source": """
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(".") >= 0) {
              def sanitizedKey = k.replace(".", "_");
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          """,
        "description": "Normalize kubernetes audit annotations field as expected by the Integration"
      }
    }
  ]
}

The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.

We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11].

The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog Simplifying log data management: Harness the power of flexible routing with Elastic and our documentation Reroute processor | Elasticsearch Guide [8.11] | Elastic.

In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:

PUT _ingest/pipeline/app-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.container_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.container_logs-openshift"
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-audit-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.audit_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.audit_logs-openshift"
      }
    }
  ]
}

Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the data_stream fields for us. Note that this step is done automatically by Elastic Agent at source.

Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.

PUT app-write
{
  "settings": {
      "index.default_pipeline": "app-write-reroute-pipeline"
   }
}


PUT audit-write
{
  "settings": {
    "index.default_pipeline": "audit-write-reroute-pipeline"
  }
}

Basically, what we did can be summarized in this picture:

Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes integration documentation, which in turn can lead to the execution of an another integration pipeline.

Create a user for sending the logs

We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:

PUT _security/role/YOURROLE
{
    "cluster": [
      "monitor"
    ],
    "indices": [
      {
        "names": [
          "logs-*-*"
        ],
        "privileges": [
          "auto_configure",
          "create_doc"
        ],
        "allow_restricted_indices": false
      },
      {
        "names": [
          "app-write",
          "audit-write",
        ],
        "privileges": [
          "create_doc",
          "read"
        ],
        "allow_restricted_indices": false
      }
    ],
    "applications": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }

}



PUT _security/user/YOUR_USERNAME
{
  "password": "YOUR_PASSWORD",
  "roles": ["YOURROLE"]
}

On OpenShift

On the OpenShift Cluster, we need to follow the official documentation of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.

We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}

The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:

apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD

Finally, we create the ClusterLogForwarder resource:

kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: "https://YOUR_ELASTICSEARCH_URL:443"
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch

Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.

Result

Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:

In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do more.

Learn more about log monitoring with the Elastic Stack.

Have feedback on this blog? Share it here.

Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel

Wed, 01 Nov 2023 00:00:00 GMT

As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.

As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.

In Elastic 8.11, a technical preview is now available of Elastic’s new piped query language, ES|QL (Elasticsearch Query Language), which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.

Advantages of ES|QL for SREs

SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:

Improved operational efficiency: By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.
Enhanced analysis with insights: ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.
Reduced mean time to resolution: ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.

ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.

In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.

You can also check out our Elastic Observability ES|QL Demo, which walks through ES|QL functionality for Observability.

ES|QL with AI Assistant

As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.

Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.

My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS

The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:

load generator
front-end proxy
frontendservice
checkoutservice

With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.

Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through Elastic APM failure correlation , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.

ES|QL insightful and contextual analysis in Discover

Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.

We use the following query in ES|QL in Elastic Discover:

from metrics-* | where kubernetes.container.status.last_terminated_reason != "" and kubernetes.namespace == "default" | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count > 0

ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.

The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.

We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.

ES|QL easily found the average memory usage fairly high.

We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.

Actionable alerts with ES|QL

Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.

We modify the last query to find any service with high memory usage:

FROM metrics*
| WHERE @timestamp >= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage > .5

With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.

With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.

Make the most of your data with ES|QL

In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at https://ela.st/free-trial now in technical preview.

Elastic Observability Tour

The power of effective log management

Transforming Observability with the AI Assistant

ES|QL announcement blog

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1

Wed, 25 Sep 2024 00:00:00 GMT

Introduction:

The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.

In Part 1 of this blog, we will cover the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and the redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Here is the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Tools and Techniques

There are four general capabilities that we will use for this exercise.

Named Entity Recognition Detection (NER)
Pattern Matching Detection
Log Sampling
Ingest Pipelines as Composable Processing

Named Entity Recognition (NER) Detection

NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:

Person: Names of individuals, including celebrities, politicians, and historical figures.
Organization: Names of companies, institutions, and organizations.
Location: Geographic locations, including cities, countries, and landmarks.
Event: Names of events, including conferences, meetings, and festivals.

For our use PII case, we will choose the base BERT NER model bert-base-NER that can be downloaded from Hugging Face and loaded into Elasticsearch as a trained model.

Important Note: NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.

Pattern Matching Detection

In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch redact processor is built for this use case.

Log Sampling

Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.

Ingest Pipelines as Composable Processing

We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.

Building the Processing Flow

Logs Sampling + Composable Ingest Pipelines

The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with sample.sampled: true. Further processing on the logs will be driven by the value of sample.sampled. The sample.sample_rate can be set here or "passed in" from the orchestration pipeline.

The command should be run from the Kibana -> Dev Tools

The code can be found here for the following three sections of code.

logs-sampler pipeline code - click to open/close

# logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  "processors": [
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "if": "ctx.sample.sample_rate == null",
        "field": "sample.sample_rate",
        "value": 10000
      }
    },
    {
      "set": {
        "description": "Determine if keeping unsampled docs",
        "if": "ctx.sample.keep_unsampled == null",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "set": {
        "field": "sample.sampled",
        "value": false
      }
    },
    {
      "script": {
        "source": """ Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); """,
        "params": {
          "max": 10000
        }
      }
    },
    {
      "set": {
        "if": "ctx.sample.random <= ctx.sample.sample_rate",
        "field": "sample.sampled",
        "value": true
      }
    },
    {
      "drop": {
         "description": "Drop unsampled document if applicable",
        "if": "ctx.sample.keep_unsampled == false && ctx.sample.sampled == false"
      }
    }
  ]
}

Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the logs@custom ingest pipeline that will be automatically called using the logs data stream framework for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.

Next, we will create the process-pii pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.

process-pii pipeline code - click to open/close

# Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    }
  ]
}

Finally, we create the logs logs@custom, which will simply call our process-pii pipeline based on the correct data_stream.dataset

logs@custom pipeline code - click to open/close

# logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "set": {
        "field": "pipelinetoplevel",
        "value": "logs@custom"
      }
    },
        {
      "set": {
        "field": "pipelinetoplevelinfo",
        "value": "{{{data_stream.dataset}}}"
      }
    },
    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", 
        "name": "process-pii"
      }
    }
  ]
}

Now, let's test to see the sampling at work.

Load the data as described here Data Loading Appendix. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.

If you look at Observability -> Logs -> Logs Explorer with KQL filter data_stream.dataset : pii and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%

At this point we have a composable ingest pipeline that is "sampling" logs. As a bonus, you can use this logs sampler for any other use cases you have as well.

Loading, Configuration, and Execution of the NER Pipeline

Loading the NER Model

You will need a Machine Learning node to run the NER model on. In this exercise, we are using Elastic Cloud Hosted Deployment on AWS with the CPU Optimized (ARM) architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.

This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU

Please refer to the official documentation on how to import an NLP-trained model into Elasticsearch for complete instructions on uploading, configuring, and deploying the model.

The quickest way to get the model is using the Eland Docker method.

The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.

docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

Deploy and Start the NER Model

In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.

To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available here. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.

To deploy and start the NER Model. We will do this using the Start trained model deployment API

We will configure the following:

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate
8192 Queue

# Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&number_of_allocations=4&threads_per_allocation=1&queue_capacity=8192

You should get a response that looks something like this.

{
  "assignment": {
    "task_parameters": {
      "model_id": "dslim__bert-base-ner",
      "deployment_id": "dslim__bert-base-ner",
      "model_bytes": 430974836,
      "threads_per_allocation": 1,
      "number_of_allocations": 4,
      "queue_capacity": 8192,
      "cache_size": "0",
      "priority": "normal",
      "per_deployment_memory_bytes": 430914596,
      "per_allocation_memory_bytes": 629366952
    },
...
    "assignment_state": "started",
    "start_time": "2024-09-23T21:39:18.476066615Z",
    "max_assigned_allocations": 4
  }
}

The NER model has been deployed and started and is ready to be used.

The following ingest pipeline implements the NER model via the inference processor.

There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.

The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.
The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field redact.message. We will look at this a little closer in the following steps.

The code can be found here for the following three sections of code.

The NER PII Pipeline

logs-ner-pii-processor pipeline code - click to open/close

# NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  "processors": [
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to true to keep ml results for debugging",
        "field": "redact.ner.keep_result",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.skip_entity",
        "value": "NONE"
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.minimum_score",
        "value": 0
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "set": {
        "field": "redact.ner.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.ner.found",
        "value": false
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_NER_FAILED"
            }
          },
          {
            "set": {
              "field": "redact.ner.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "script": {
        "if": "ctx.failure_ner != 'REDACT_NER_FAILED'",
        "lang": "painless",
        "source": """String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) && 
          	  (item['class_probability'] >= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '<' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER>')
          	}
          }
          ctx.redact.message = msg""",
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_REPLACEMENT_SCRIPT_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.successful",
              "value": false
            }
          }
        ]
      }
    },
    
    {
      "set": {
        "if": "ctx?.ml?.inference?.entities.size() > 0", 
        "field": "redact.ner.found",
        "value": true,
        "ignore_failure": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.ner?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx.redact.ner.keep_result != true",
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

The updated PII Processor Pipeline, which now calls the NER Pipeline

process-pii pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    }
  ]
}

Now reload the data as described here in Reloading the logs

Results

Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query data_stream.dataset : pii and ml.inference.entities.class_name : ("PER" and "LOC" and "ORG" )

Logs Explorer should look something like this, open the top message to see the details.

NER Model Results

Lets take a closer look at what these fields mean.

Field: ml.inference.entities.class_name
Sample Value: [PER, PER, LOC, ORG, ORG]
Description: An array of the named entity classes that the NER model has identified.

Field: ml.inference.entities.class_probability
Sample Value: [0.999, 0.972, 0.896, 0.506, 0.595]
Description: The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.' You can see in this example it identified a LOC as an ORG, we can filter this out / find them by setting a threshold.

Field: ml.inference.entities.entity
Sample Value: [Paul Buck, Steven Glens, South Amyborough, ME, Costco]
Description: The array of entities identified that align positionally with the class_name and class_probability.

Field: ml.inference.predicted_value
Sample Value: [2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&Steven+Glens), [South Amyborough](LOC&South+Amyborough), [ME](ORG&ME) 93580, Ordered from: [Costco](ORG&Costco)
Description: The predicted value of the model.

PII Assessment Dashboard

Lets take a quick look at a dashboard built to assess PII the data.

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-1.ndjson file that can be found here:

https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson

More complete instructions on Kibana Saved Objects can be found here.

After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.

Summary and Next Steps

In this first part of the blog, we have accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assement
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In the upcoming Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

Note To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique run.id for each run which is displayed at the end of the loading process.

$ python load_logs.py

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2

Tue, 22 Oct 2024 00:00:00 GMT

Introduction:

In Part 1 of this blog, we covered the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog, we will cover the following:

Apply the redact regex pattern processor and assess the results
Create Alerts using ESQL
Apply field-level security to control access to the un-redacted data
Production considerations and scaling
How to run these processes on incoming or historical data

Reminder of the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Part 1 Prerequisites

This blog picks up where Part 1 of this blog left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.

Loaded and configured NER Model
Installed all the composable ingest pipelines from Part 1 of the blog
Installed dashboard

You can access the complete solution for Blog 1 here. Don't forget to load the dashboard, found here.

Applying the Redact Processor

Next, we will apply the redact processor. The redact processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The redact processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the production scaling section.

Elasticsearch comes packaged with a number of useful predefined patterns that can be conveniently referenced by the redact processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.

In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.

        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",      << Predefined
          "%{IP:IP_ADDRESS_REGEX}",           << Predefined
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}", << Custom
          "%{SSN:SSN_REGEX}",                 << Custom
          "%{PHONE:PHONE_REGEX}"              << Custom
        ]

We also replaced the PII with easily identifiable patterns we can use for assessment.

In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many "secrets" patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  "processors": [
    {
      "set": {
        "field": "redact.proc.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.proc.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "redact": {
        "field": "redact.message",
        "prefix": "",
        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",
          "%{IP:IP_ADDRESS_REGEX}",
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}",
          "%{SSN:SSN_REGEX}",
          "%{PHONE:PHONE_REGEX}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": """\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}""",
          "SSN": """\d{3}-\d{2}-\d{4}""",
          "PHONE": """(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}"""
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_PROCESSOR_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.proc.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message.contains('REDACTPROC')",
        "field": "redact.proc.found",
        "value": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.proc?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

And now, we will add the logs-pii-redact-processor pipeline to the overall process-pii pipeline

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    }
  ]
}

Reload the data as described in the Reloading the logs. If you have not generated the logs the first time, follow the instructions in the Data Loading Appendix

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.message: REDACTPROC and add the redact.message to the table and you should see something like this.

And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found here using the Kibana -> Stack Management -> Saved Objects -> Import.

It should look something like this now. Note that the REGEX portions of the dashboard are now active.

Checkpoint

At this point, we have the following capabilities:

Ability to sample incoming logs and apply this PII redaction
Detect and Assess PII with the NER/NLP and Pattern Matching
Assess the amount, type and quality of the PII detections

This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.

Clean up the working and unredacted data
Update the Dashboard to work with the cleaned-up data
Apply Role Based Access Control to protect the raw unredacted data
Create Alerts
Production and Scaling Considerations
How to run these processes on incoming or historical data

Applying to Production Systems

Cleanup working data and update the dashboard

And now we will add the cleanup code to the overall process-pii pipeline.

In short, we set a flag redact.enable: true that directs the pipeline to move the unredacted message field to raw.message and the move the redacted message field redact.messageto the message field. We will "protect" the raw.message in the following section.

NOTE: Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.

In addition we set redact.cleanup: true to clean up the NLP working data.

These fields allow a lot of control over what data you decide to keep and analyze.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    },
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "message",
        "target_field": "raw.message"
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "redact.message",
        "target_field": "message"
      }
    },
    {
      "set": {
        "description": "Set to true to actually to clean up working data",
        "field": "redact.cleanup",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx?.redact?.cleanup == true",
        "field": [
          "ml"
        ],
        "ignore_failure": true
      }
    }
  ]
}

Reload the data as described here in the Reloading the logs.

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.pii.found: true and add the following fields to the table

message,raw.message,redact.ner.found,redact.proc.found,redact.pii.found

You should see something like this

We have everything we need to move forward with protecting the PII and Alerting on it.

Load up the new dashboard that works on the cleaned-up data

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-2.ndjson file that can be found here.

The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.

You should see something like this

Apply Role Based Access Control to protect the raw unredacted data

Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.

We will create a Role that does not allow access to the raw.message field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the message field, but will not be able to access the protected raw.message field.

NOTE: Since we only sampled 10% of the data in this exercise the non-sampled message fields are not moved to the raw.message, so they are still viewable, but this shows the capability you can apply in a production system.

The code can be found here for the following section of code.

RBAC protect-pii role and user code - click to open/close

# Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 "cluster": [],
 "indices": [
   {
     "names": [
       "logs-*"
     ],
     "privileges": [
       "read",
       "view_index_metadata"
     ],
     "field_security": {
       "grant": [
         "*"
       ],
       "except": [
         "raw.message"
       ]
     },
     "allow_restricted_indices": false
   }
 ],
 "applications": [
   {
     "application": "kibana-.kibana",
     "privileges": [
       "all"
     ],
     "resources": [
       "*"
     ]
   }
 ],
 "run_as": [],
 "metadata": {},
 "transient_metadata": {
   "enabled": true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 "password" : "mypassword",
 "roles" : [ "protect-pii" ],
 "full_name" : "Stephen Brown"
}

Now log into a separate window with the new user stephen with the protect-pii role. Go to Discover and put redact.pii.found : true in the KQL bar and add the message field to the table. Also, notice that the raw.message is not available.

You should see something like this

Create an Alert when PII Detected

Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review Alerting in Kibana in detail if needed

NOTE: Reload the data if needed to have recent data.

First, we will create a simple ES|QL query in Discover.

The code can be found here.

FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count > 0

When you run this you should see something like this.

Now click the Alerts menu and select Create search threshold rule, and will create an alert to alert us when PII is found.

Select a time field: @timestamp Set the time window: 5 minutes

Assuming you loaded the data recently when you run Test it should do something like

pii_count : 343 Alerts generated query matched

Add an action when the alert is Active.

For each alert: On status changes Run when: Query matched

Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

Add an Action for when the Alert is Recovered.

For each alert: On status changes Run when: Recovered

Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

When all setup it should look like this and Save

You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.

Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

And then if you wait you will get a Recovered alert that looks like this.

Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

Production Scaling

NER Scaling

As we mentioned Part 1 of this blog of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.

Please review the setup and configuration of the NER model from Part 1 of the blog.

We chose the base BERT NER model bert-base-NER for our PII case.

The metrics below are related to the model and configuration from Part 1 of the blog.

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate Note If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process
8192 Queue

GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           "node": {
              "0m4tq7tMRC2H5p5eeZoQig": {
.....
                "attributes": {
                  "xpack.installed": "true",
                  "region": "us-west-1",
                  "ml.allocated_processors": "5", << HERE 
.....
            },
            "inference_count": 5040,
            "average_inference_time_ms": 138.44285714285715, << HERE 
            "average_inference_time_ms_excluding_cache_hits": 138.44285714285715,
            "inference_cache_hit_count": 0,
.....
            "threads_per_allocation": 1,
            "number_of_allocations": 4,  <<< HERE
            "peak_throughput_per_minute": 1550,
            "throughput_last_minute": 1373,
            "average_inference_time_ms_last_minute": 137.55280407865988,
            "inference_cache_hit_count_last_minute": 0
          }
        ]
      }
    }

There are 3 key pieces of information above:

"ml.allocated_processors": "5" The number of physical cores / processors available
"number_of_allocations": 4 The number of allocations which is maximum 1 per physical core. Note: we could have used 5 allocations, but we only allocated 4 for this exercise
"average_inference_time_ms": 138.44285714285715 The averages inference time per document.

The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.

Then the Inferences per Min per Allocation is simply:

IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435

When then lines up with the Total Inferences per Minute

Total IPM = 435 IPM / allocation * 4 Allocations = ~1740

Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?

Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)

Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.

IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled

Then

Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)

Want Faster! Turns out there is a more lightweight NER Model distilbert-NER model that is faster, but the tradeoff is a little less accuracy.

Running the logs through this model results in an inference time nearly twice as fast!

"average_inference_time_ms": 66.0263959390863

Here is some quick math: $IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983

Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?

Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)

Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.

Redact Processor Scaling

In short, the redact processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.

Assessing incoming logs

If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the logs@custom pipeline to apply the process-pii to the dataset you want to. You can use any conditional that fits your condition.

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", <<< HERE
        "name": "process-pii"
      }
    }

So if for example your logs are coming into logs-mycustomapp-default you would just change the conditional to

        "if": "ctx?.data_stream?.dataset == 'mycustomapp'",

Assessing historical data

If you have a historical (already ingested) data stream or index you can run the assessment over them using the _reindex API>

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

There are a couple of extra steps: The code can be found here.

First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the process-pii pipeline, there is a setting sample.keep_unsampled, which we can set to false, which will then only keep the sampled data

    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": false <<< SET TO false
      }
    },

Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct dataset and namespace

DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  "processors": [
    {
      "set": {
        "field": "data_stream.dataset",
        "value": "pii"
      }
    },
    {
      "set": {
        "field": "data_stream.namespace",
        "value": "default"
      }
    },
    {
      "reroute" : 
      {
        "dataset" : "{{data_stream.dataset}}",
        "namespace": "{{data_stream.namespace}}"
      }
    }
  ]
}

Finally, we can run a _reindex to select the data we want to test/assess. It is recommended to review the _reindex documents before trying this. First, select the source data stream you want to assess, in this example, it is the logs-generic-default logs data stream. Note: I also added a range filter to select a specific time range. There is a bit of a "trick" that we need to use since we are re-routing the data to the data stream logs-pii-default. To do this, we just set "index": "logs-tmp-default" in the _reindex as the correct data stream will be set in the pipeline. We must do that because reroute is a noop if it is called from/to the same datastream.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "logs-generic-default",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now-1h/h",
                "lt": "now"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "op_type": "create",
    "index": "logs-tmp-default",
    "pipeline": "sendtopii"
  }
}

Summary

At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.

The end state solution can be found here:.

In Part 1 of this blog, we accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assessment
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In Part 2 of this blog, we covered the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

So get to work and reduce risk in your logs!

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

$ python load_logs.py

Pruning incoming log volumes with Elastic

Fri, 23 Jun 2023 00:00:00 GMT

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/log/*.log

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_event:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            url.path: /profile

filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_fields:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            http.response.status_code: 200
        fields: ["event.message"]
        ignore_missing: false

input {
  file {
    id => "my-logging-app"
    path => [ "/var/tmp/other.log", "/var/log/*.log" ]
  }
}
filter {
  if [url.scheme] == "http" && [url.path] == "/profile" {
    drop {
      percentage => 80
    }
  }
}
output {
  elasticsearch {
        hosts => "https://my-elasticsearch:9200"
        data_stream => "true"
    }
}

# Input configuration omitted
filter {
  if [url.scheme] == "http" && [http.response.status_code] == 200 {
    drop {
      percentage => 80
    }
    mutate {
      remove_field: [ "event.message" ]
    }
  }
}
# Output configuration omitted

PUT _ingest/pipeline/my-logging-app-pipeline
{
  "description": "Event and field dropping for my-logging-app",
  "processors": [
    {
      "drop": {
        "description" : "Drop event",
        "if": "ctx?.url?.scheme == 'http' && ctx?.url?.path == '/profile'",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "description" : "Drop field",
        "field" : "event.message",
        "if": "ctx?.url?.scheme == 'http' && ctx?.http?.response?.status_code == 200",
        "ignore_failure": false
      }
    }
  ]
}

PUT _ingest/pipeline/my-logging-app-pipeline
{
  "description": "Event and field dropping for my-logging-app with failures",
  "processors": [
    {
      "drop": {
        "description" : "Drop event",
        "if": "ctx?.url?.scheme == 'http' && ctx?.url?.path == '/profile'",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "description" : "Drop field",
        "field" : "event.message",
        "if": "ctx?.url?.scheme == 'http' && ctx?.http?.response?.status_code == 200",
        "ignore_failure": false
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Set 'ingest.failure.message'",
        "field": "ingest.failure.message",
        "value": "Ingestion issue"
        }
      }
  ]
}

receivers:
  filelog:
    include: [/var/tmp/other.log, /var/log/*.log]
processors:
  filter/denylist:
    error_mode: ignore
    logs:
      log_record:
        - 'url.scheme == "info"'
        - 'url.path == "/profile"'
        - "http.response.status_code == 200"
  attributes/errors:
    actions:
      - key: error.message
        action: delete
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
exporters:
  # Exporters configuration omitted
service:
  pipelines:
    # Pipelines configuration omitted

Root cause analysis with logs: Elastic Observability's anomaly detection and log categorization

Tue, 07 Feb 2023 00:00:00 GMT

With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs. It’s like trying to find a needle in a haystack.

In this blog post, we’ll cover some of Elastic’s artificial intelligence for IT operations (AIOps) and machine learning (ML) capabilities for root cause analysis.

Elastic’s machine learning will help you investigate performance issues by providing anomaly detection and pinpointing potential root causes through time series analysis and log outlier detection. These capabilities will help you reduce time in finding that “needle” in the haystack.

Elastic’s platform enables you to get started on machine learning quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.

Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To help get you started, there are several key features built into Elastic Observability to aid in analysis, helping bypass the need to run specific ML models. These features help minimize the time and analysis for logs.

Let’s review some of these built-in ML features:

High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. An overview of this capability is published here: APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions.

AIOps Labs: AIOps Labs provides two main capabilities using advanced statistical methods:

Log spike detector helps identify reasons for increases in log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.

_ In this blog, we will cover anomaly detection and log categorization against the popular “Hipster Shop app” developed by Google, and modified recently by OpenTelemetry. _

Overviews of high-latency capabilities can be found here, and an overview of AIOps labs can be found here.

In this blog, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Utilize a version of the ever so popular Hipster Shop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the OpenTelemetry Demo App. The Elastic version is found here.
Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: Independence with OTel in Elastic and Observability and security with OTel in Elastic. Additionally, review the OTel documentation in Elastic.
Look through an overview of Elastic Observability APM capabilities.
Look through our Anomaly detection documentation for logs and log categorization documentation.

In our example, we’ve introduced issues to help walk you through the root cause analysis features: anomaly detection and log categorization. You might have a different set of anomalies and log categorization depending on how you load the application and/or introduce specific issues.

As part of the walk-through, we’ll assume we are a DevOps or SRE managing this application in production.

Root cause analysis

While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.

How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:

Anomaly detection
Log categorization

Machine learning for anomaly detection

Elastic will detect anomalies based on historical patterns and identify a probability of these issues.

Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.

In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. More information on anomaly detection results can be found here. We can also dive deeper into the anomaly and analyze the details.

What you will see for the productCatalogService is that there is a severe spike in average transaction latency time, which is the anomaly that was detected in the service map. Elastic’s machine learning has identified a specific metric anomaly (shown in the single metric view). It’s likely that customers are potentially responding to the slowness of the site and that the company is losing potential transactions.

One step to take next is to review all the other potential anomalies that we saw in the service map in a larger picture. Use an anomaly explorer to view all the anomalies that have been identified.

Elastic is identifying numerous services with anomalies. productCatalogService has the highest score and a good number or others: frontend, checkoutService, advertService, and others, also have high scores. However, this analysis is looking at just one metric.

Elastic can help detect anomalies across all types of data, such as kubernetes data, metrics, and traces. If we analyze across all these types (via individual jobs we’ve created in Elastic machine learning), we will see a more comprehensive view as to what is potentially causing this latency issue.

Once all the potential jobs are selected and we’ve sorted by service.name, we can see that productCatalogService is still showing a high anomaly influencer score.

In addition to the chart giving us a visual of the anomalies, we can review all the potential anomalies. As you will notice, Elastic has also categorized these anomalies (see category examples column). As we scroll through the results, we notice a potential postgreSQL issue from the categorization, which also has a high 94 score. Machine learning has identified a “rare mlcategory,” meaning that it has rarely occurred, hence pointing to a potential cause of the issue customers are seeing.

We also notice that this issue is potentially caused by pgbench , a popular postgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes heavy load on the database host, likely causing the higher latency issues on the site.

While this may or may not be the ultimate root cause, we have rather quickly identified a potentially issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Machine learning for log categorization

Elastic Observability’s service map has detected an anomaly, and in this part of the walk-through, we take a different approach by investigating the service details from the service map versus initially exploring the anomaly. When we explore the service details for productCatalogService, we see the following:

The service details are identifying several things:

There is an abnormally high latency compared to expected bounds of the service. We see that recently there was a higher than normal (upward of 1s latency) compared to the average to 275ms on average.
There is also a high failure rate for the same time frame as the high latency (lower left chart “ Failed transaction rate ”).
Additionally, we can see the transactions and one in particular /ListProduct has an abnormally high latency, in addition to a high failure rate.
We see productCatalogService has a dependency on postgreSQL.
We also see errors all related to postgreSQL.

We have an option to dig through the logs and analyze in Elastic or we can use a capability to identify the logs more easily.

If we go to Categories under Logs in Elastic Observability and search for postgresql.logto help identify postgresql logs that could be causing this error, we see that Elastic’s machine learning has automatically categorized the postgresql logs.

We notice two additional items:

There is a high count category (message count of 23,797 with a high anomaly of 70) related to pgbench (which is odd to see in production). Hence we search further for all pgbench related logs in Categories .
We see an odd issue regarding terminating the connection (with a low count).

While investigating the second error, which is severe, we can see logs from Categories before and after the error.

This troubleshooting shows postgreSQL having a FATAL error, the database shutting down prior to the error, and all connections terminating. Given the two immediate issues we identified, we have an idea that someone was running pgbench and this potentially overloaded the database, causing the latency issue that customers are seeing.

The next steps here could be to investigate anomaly detection and/or work with the developers to review the code and identify pgbench as part of the deployed configuration.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you further identify and get closer to pinpointing root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of lessons and what you learned:

Elastic Observability has numerous capabilities to help you reduce your time to find root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities in this blog:
1. Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.
2. Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped based on their messages and formats so that you can take action quicker.
You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which help drive these features), nor having to do any lengthy setups. Ready to get started? Register for Elastic Cloud and try out the features and capabilities I’ve outlined above.

Additional logging resources:

Common use case examples with logs:

Build better Service Level Objectives (SLOs) from logs and metrics

Fri, 23 Feb 2024 00:00:00 GMT

In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.

Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.

Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.

To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This feature enables setting measurable performance targets for services, such as availability, latency, traffic, errors, and saturation or define your own. Key components include:

Defining and monitoring SLIs (Service Level Indicators)
Monitoring error budgets indicating permissible performance shortfalls
Alerting on burn rates showing error budget consumption

Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.

Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.

In this blog, we will outline the following:

What are SLOs? A Google SRE perspective
Several scenarios of defining and managing SLOs

Service Level Objective overview

Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in Google's SRE Handbook. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:

Service Level Indicators (SLIs): These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.
Service Level Objective (SLO): An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is > .1%, the SLO will be out of compliance.
Error budget: This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.
Burn rate: This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.

Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to Google's SRE Handbook.

One main thing to remember is that SLO monitoring is not incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.

In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.

Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.
Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.
Manage all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.
Create unique service level dashboards with SLO information for a more comprehensive view of the service.

SREs need to be able to manage business metrics.

SLOs based on logs: NGINX availability

Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.

Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.

Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.

If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.

Additionally, the number of messages with http.response.status_code > 500 is minimal, like 17K.

Rather than creating an alert, we can create an SLO with this query:

We chose to use occurrences as the budgeting method to keep things simple.

Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.

Not only do we get information about the violation, but we also get:

Historical SLI (7 days)
Error budget burn down
Good vs. bad events (24 hours)

We can see how we’ve easily burned through our error budget.

Hence something must be going on with nginx. To investigate, all we need to do is utilize the AI Assistant, and use its natural language interface to ask questions to help analyze the situation.

Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.

As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.

However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.

AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.

While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:

99% of requests occur under 200ms
99% of log message are not errors

Application SLOs: OpenTelemetry demo cartservice

A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the OpenTelemetry demo.

This demo has feature flags to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.

Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.

We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.

We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.

As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.

We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.

Conclusion

SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:

SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.
AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.
APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.

For more information on SLOs in Elastic, check out Elastic documentation and the following resources:

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.

Simplifying log data management: Harness the power of flexible routing with Elastic

Tue, 13 Jun 2023 00:00:00 GMT

In Elasticsearch 8.8, we’re introducing the reroute processor in technical preview that makes it possible to send documents, such as logs, to different data streams, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the data stream naming scheme. While optimized for data streams, the reroute processor also works with classic indices. This blog post contains examples on how to use the reroute processor that you can try on your own by executing the snippets in the Kibana dev tools.

Elastic Observability offers a wide range of integrations that help you to monitor your applications and infrastructure. These integrations are added as policies to Elastic agents, which help ingest telemetry into Elastic Observability. Several examples of these integrations include the ability to ingest logs from systems that send a stream of logs from different applications, such as Amazon Kinesis Data Firehose, Kubernetes container logs, and syslog. One challenge is that these multiplexed log streams are sending data to the same Elasticsearch data stream, such as logs-syslog-default. This makes it difficult to create parsing rules in ingest pipelines and dashboards for specific technologies, such as the ones from the Nginx and Apache integrations. That’s because in Elasticsearch, in combination with the data stream naming scheme, the processing and the schema are both encapsulated in a data stream.

The reroute processor helps you tease apart data from a generic data stream and send it to a more specific one. You may use that mechanism to send logs to a data stream that is set up by the Nginx integration, for example, so that the logs are parsed with that integration and you can use the integration’s prebuilt dashboards or create custom ones with the fields, such as the url, the status code, and the response time that the Nginx pipeline has parsed out of the Nginx log message. You can also split out/separate regular Nginx logs and errors with the reroute processor, providing further separation ability and categorization of logs.

Example use case

To use the reroute processor, first:

Ensure you are on Elasticsearch 8.8
Ensure you have permissions to manage indices and data streams
If you don’t already have an account on Elastic Cloud, sign up for one

Next, you’ll need to set up a data stream and create a custom Elasticsearch ingest pipeline that is called as the default pipeline. Below we go through this step by step for the “mydata” data set that we’ll simulate ingesting container logs into. We start with a basic example and extend it from there.

The following steps should be utilized in the Elastic console, which is found at Management -> Dev tools -> Console. First, we need an an ingest pipeline and a template for the data stream:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
      }
    }
  ]
}

This creates an ingest pipeline with an empty reroute processor. To make use of it, we need an index template:

PUT _index_template/logs-mydata
{
  "index_patterns": [
    "logs-mydata-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.default_pipeline": "logs-mydata"
    },
    "mappings": {
      "properties": {
        "container.name": {
          "type": "keyword"
        }
      }
    }
  }
}

The above template is applied to all data that is shipped to logs-mydata-*. We have mapped container.name as a keyword, as this is the field we will be using for routing later on. Now, we send a document to the data stream and it will be ingested into logs-mydata-default:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

We can check that it was ingested with the command below, which will show 1 result.

GET logs-mydata-default/_search

Without modifying the routing processor, this already allows us to route documents. As soon as the reroute processor is specified, it will look for data_stream.dataset and data_stream.namespace fields by default and will send documents to the corresponding data stream, according to the data stream naming scheme logs--. Let’s try this out:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-03-30T12:27:23+00:00",
  "container": {
"name": "foo"
  },
  "data_stream": {
    "dataset": "myotherdata"
  }
}

As can be seen with the GET logs-mydata-default/_search command, this document ended up in the logs-myotherdata-default data stream. But instead of using default rules, we want to create our own rules for the field container.name. If the field is container.name = foo, we want to send it to logs-foo-default. For this we modify our routing pipeline:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "foo",
        "if" : "ctx.container?.name == 'foo'",
        "dataset": "foo"
      }
    }
  ]
}

Let's test this with a document:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

While it would be possible to specify a routing rule for each container name, you can also route by the value of a field in the document:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "mydata",
        "dataset": [
          "{{container.name}}",
          "mydata"
        ]
      }
    }
  ]
}

In this example, we are using a field reference as a routing rule. If the container.name field exists in the document, it will be routed — otherwise it falls back to mydata. This can be tested with:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo1"
  }
}

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo2"
  }
}

This creates the data streams logs-foo1-default and logs-foo2-default.

NOTE: There is currently a limitation in the processor that requires the fields specified in a {{field.reference}} to be in a nested object notation. A dotted field name does not currently work. Also, you’ll get errors when the document contains dotted field names for any data_stream.* field. This limitation will be fixed in 8.8.2 and 8.9.0.

API keys

When using the reroute processor, it is important that the API keys specified have permissions for the source and target indices. For example, if a pattern is used for routing from logs-mydata-default, the API key must have write permissions for logs-*-* as data could end up in any of these indices (see example further down).

We’re currently working on extending the API key permissions for our integrations so that they allow for routing by default if you’re running a Fleet-managed Elastic Agent.

If you’re using a standalone Elastic Agent, or any other shipper, you can use this as a template to create your API key:

POST /_security/api_key
{
  "name": "ingest_logs",
  "role_descriptors": {
    "ingest_logs": {
      "cluster": [
        "monitor"
      ],
      "indices": [
        {
          "names": [
            "logs-*-*"
          ],
          "privileges": [
            "auto_configure",
            "create_doc"
          ]
        }
      ]
    }
  }
}

Future plans

In Elasticsearch 8.8, the reroute processor was released in technical preview. The plan is to adopt this in our data sink integrations like syslog, k8s, and others. Elastic will provide default routing rules that just work out of the box, but it will also be possible for users to add their own rules. If you are using our integrations, follow this guide on how to add a custom ingest pipeline.

Try it out!

This blog post has shown some sample use cases for document based routing. Try it out on your data by adjusting the commands for index templates and ingest pipelines to your own data, and get started with Elastic Cloud through a 7-day free trial. Let us know via this feedback form how you’re planning to use the reroute processor and whether you have suggestions for improvement.

Smarter log analytics in Elastic Observability

Mon, 10 Jun 2024 00:00:00 GMT

Discover a smarter way to handle your logs with Kibana's latest features! Our new Data Source selector makes it effortless to zero in on the logs you need, whether they're from System Logs or Application Logs by selecting your integrations or data views. Plus, with the introduction of Smart Fields, your log analysis is now more intuitive and insightful. Get ready to simplify your workflow and uncover deeper insights with these game-changing updates. Dive in and see how easy log exploration can be!

Find the logs you’re looking for

Focus on logs from specific integrations or data views

We've added the Data Source selector, a handy new feature for viewing specific logs. Now, you can easily filter your logs based on your integrations, like System Logs, Nginx, or Elastic APM, or switch between different data views, like logs or metrics. This new selector is all about making your data easier to find and helping you focus on what matters most in your analysis.

Dive into your logs

Analyze logs with Smart Fields in Kibana

Logs in Kibana have undergone a significant transformation, particularly in the way log data is presented. The once-basic table view has evolved with the introduction of Smart Fields, providing users with a more insightful and dynamic log analysis experience.

Resource Smart Field - centralizing log source information

The resource column further elevates the Logs Explorer page by providing users with a single column for exploring the resource that created the log event. This column groups various resource-indicating fields together, streamlining the investigation process. Currently, the following ECS fields are grouped under this single column and we recommend including them in your logs:

We know this does not include all use cases and would like your feedback on other fields you use/are important for you to help us provide a tailored and user-centric log analysis experience.

Content Smart Field - a deeper dive into log data

The content column revolutionizes log analysis by seamlessly rendering log.level and message fields. Notably, it automatically handles fallbacks, ensuring a smooth transition when the actual message field is not available. This enhancement simplifies the log exploration process, offering users a more comprehensive understanding of their data.

Actions column - unleashing additional columns

As part of our commitment to empowering users, we are introducing the actions column, adding a layer of functionality to the document table. This column includes two powerful actions:

Degraded document indicator: This indicator provides insights about the quality of your data by indicating fields were ignored when the document was indexed and ended up in the _ignored property of the document. To help analyze what caused the document to degrade, we suggest reading this blog - The antidote for index mapping exceptions: ignore_malformed.
Stacktrace indicator: This indicator informs users of the presence of stack traces in the document. This makes it easy to navigate through logs documents and know if they have additional information.

Investigate individual logs by expanding log details

Now, when you click the expand icon in the actions column, it opens up the Log details flyout for any log entry. This new feature gives you a detailed overview of the entry right at your fingertips. Inside the flyout, the Overview tab is neatly organized into four sections—Content breakdown, Service & Infrastructure, Cloud, and Others—each offering a snapshot of the most crucial information. Plus, you'll find the same handy controls you're used to in the main table, like filtering in or out, adding or removing columns, and copying data, making it easier than ever to manage your logs directly from the flyout.

The Observability AI Assistant is fully integrated into this view providing contextual insights about the log event and helping to find similar messages.

Experience a streamlined approach to log exploration

These enhancements simplify the process of finding and focusing on specific logs and offer more intuitive and insightful data presentation. Dive into your logs with these I tools and streamline your workflow, uncovering deeper insights with ease. Try it now and transform your log analysis!

Easily analyze AWS VPC Flow Logs with Elastic Observability

Mon, 23 Jan 2023 00:00:00 GMT

Elastic Observability provides a full-stack observability solution, by supporting metrics, traces, and logs for applications and infrastructure. In a previous blog, I showed you an AWS monitoring infrastructure running a three-tier application. Specifically we reviewed metrics ingest and analysis on Elastic Observability for EC2, VPC, ELB, and RDS. In this blog, we will cover how to ingest logs from AWS, and more specifically, we will review how to get VPC Flow Logs into Elastic and what you can do with this data.

Logging is an important part of observability, for which we generally think of metrics and/or tracing. However, the amount of logs an application or the underlying infrastructure output can be significantly daunting.

With Elastic Observability, there are three main mechanisms to ingest logs:

The new Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53, etc ). We reviewed Elastic agent metrics configuration for EC2, RDS (Aurora), ELB, and NAT metrics in this blog.
Using Elastic’s Serverless Forwarder (runs on Lambda and available in AWS SAR) to send logs from Firehose, S3, CloudWatch, and other AWS services into Elastic.
Beta feature (contact your Elastic account team): Using AWS Firehose to directly insert logs from AWS into Elastic — specifically if you are running the Elastic stack on AWS infrastructure.

In this blog we will provide an overview of the second option, Elastic’s serverless forwarder collecting VPC Flow Logs from an application deployed on EC2 instances. Here’s what we'll cover:

A walk-through on how to analyze VPC Flow Log info with Elastic’s Discover, dashboard, and ML analysis.
A detailed step-by-step overview and setup of the Elastic serverless forwarder on AWS as a pipeline for VPC Flow Logs into Elastic Cloud.

Elastic’s serverless forwarder on AWS Lambda

AWS users can quickly ingest logs stored in Amazon S3, CloudWatch, or Kinesis with the Elastic serverless forwarder, an AWS Lambda application, and view them in the Elastic Stack alongside other logs and metrics for centralized analytics. Once the AWS serverless forwarder is configured and deployed from AWS, Serverless Application Registry (SAR) logs will be ingested and available in Elastic for analysis. See the following links for further configuration guidance:

In our configuration we will ingest VPC Flow Logs into Elastic for the three-tier app deployed in the previous blog.

There are three different configurations with the Elastic serverless forwarder:

Logs can be directly ingested from:

Amazon CloudWatch: Elastic serverless forwarder can pull VPC Flow Logs directly from an Amazon CloudWatch log group, which is a commonly used endpoint to store VPC Flow Logs in AWS.
Amazon Kinesis: Elastic serverless forwarder can pull VPC Flow Logs directly from Kinesis, which is another location to publish VPC Flow Logs.
Amazon S3: Elastic serverless forwarder can pull VPC Flow Logs from Amazon S3 via SQS event notifications, which is a common endpoint to publish VPC Flow Logs in AWS.

We will review how to utilize a common configuration, which is to send VPC Flow Logs to Amazon S3 and into Elastic Cloud in the second half of this blog.

But first let's review how to analyze VPC Flow Logs on Elastic.

Analyzing VPC Flow Logs in Elastic

Now that you have VPC Flow Logs in Elastic Cloud, how can you analyze them?

There are several analyses you can perform on the VPC Flow Log data:

Use Elastic’s Analytics Discover capabilities to manually analyze the data.
Use Elastic Observability’s anomaly feature to identify anomalies in the logs.
Use an out-of-the-box (OOTB) dashboard to further analyze data.

Using Elastic Discover

In Elastic analytics, you can search and filter your data, get information about the structure of the fields, and display your findings in a visualization. You can also customize and save your searches and place them on a dashboard. With Discover, you can:

View logs in bulk, within specific time frames
Look at individual details of each entry (document)
Filter for specific values
Analyze fields
Create and save searches
Build visualizations

For a complete understanding of Discover and all of Elastic’s analytics capabilities, look at Elastic documentation.

For VPC Flow Logs, an important stat is to understand:

How many logs were accepted/rejected
Where potential security violations are occur (for example, source IPs from outside the VPC)
What port is generally being queried

I’ve filtered the logs on the following:

Amazon S3: bshettisartest
VPC Flow Log action: REJECT
VPC Network Interface: Webserver 1

We want to see what IP addresses are trying to hit our web servers.

From that, we want to understand which IP addresses we are getting the most REJECTS from, and we simply find the source.ip field. Then, we can quickly get a breakdown that shows 185.242.53.156 is the most rejected for the last 3+ hours we’ve turned on VPC Flow Logs.

Additionally, I can see a visualization by selecting the “Visualize” button. We get the following, which we can add to a dashboard:

In addition to IP addresses, we want to also see what port is being hit on our web servers.
We select the destination port field, and the quick pop-up shows us a list of ports being targeted. We can see that port 23 is being targeted (this port is generally used for telnet), port 445 is being targeted (used for Microsoft Active Directory), and port 433 (used for https ssl). We also see these are all REJECT.

Anomaly detection in Elastic Observability logs

Addition to Discover, Elastic Observability provides the ability to detect anomalies on logs. In Elastic Observability -> logs -> anomalies you can turn on machine learning for:

Log rate: automatically detects anomalous log entry rates
Categorization: automatically categorizes log messages

For our VPC Flow Log, we turned both on. And when we look at what has been detected for anomalous log entry rates, we see:

Elastic immediately detected a spike in logs when we turned on VPC Flow Logs for our application. The rate change is being detected because we’re also ingesting VPC Flow Logs from another application for a couple of days prior to adding the application in this blog.

We can further drill down into this anomaly with machine learning and analyze further.

There is more machine learning analysis you can utilize with your logs — check out Elastic machine learning documentation.

Since we know that a spike exists, we can also use Elastic AIOps Labs Explain Log Rate Spikes capability in Machine Learning. Additionally, we’ve grouped them to see what is causing some of the spikes.

As we can see, a specific network interface is sending more VPC log flows than others. We can further drill down into this further in Discover.

VPC Flow Log dashboard on Elastic Observability

Finally, Elastic also provides an OOTB dashboard to showing the top IP addresses hitting your VPC, geographically where they are coming from, the time series of the flows, and a summary of VPC Flow Log rejects within the time frame.

This is a baseline dashboard that can be enhanced with visualizations you find in Discover, as we reviewed in option 1 (Using Elastic’s Analytics Discover capabilities) above.

Setting it all up

Let’s walk through the details of configuring Amazon Kinesis Data Firehose and Elastic Observability to ingest data.

Prerequisites and config

If you plan on following steps, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.
Ensure you have an AWS account with permissions to pull the necessary data from AWS. Specifically, ensure you can configure the agent to pull data from AWS as needed. Please look at the documentation for details.
We used AWS’s three-tier app and installed it as instructed in GitHub. (See blog on ingesting metrics from the AWS services supporting this app.)
Configure and install Elastic’s Serverless Forwarder.
Ensure you turn on VPC Flow Logs for the VPC where the application is deployed and send logs to AWS Firehose.

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Deploy Elastic on AWS

Once logged in to Elastic Cloud, create a deployment on AWS. It’s important to ensure that the deployment is on AWS. The Amazon Kinesis Data Firehose connects specifically to an endpoint that needs to be on AWS.

Once your deployment is created, make sure you copy the Elasticsearch endpoint.

The endpoint should be an AWS endpoint, such as:

https://aws-logs.es.us-east-1.aws.found.io

Step 2: Turn on Elastic’s AWS Integrations on AWS

In your deployment’s Elastic Integration section, go to the AWS integration and select Install AWS assets.

Step 3: Deploy your application

Follow the instructions listed out in AWS’s Three-Tier app and instructions in the workshop link on GitHub. The workshop is listed here.

Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.

There are several options for credentials:

Use access keys directly
Use temporary security credentials
Use a shared credentials file
Use an IAM role Amazon Resource Name (ARN)

View more details on specifics around necessary credentials and permissions.

Step 4: Send VPC Flow Logs to Amazon S3 and set up Amazon SQS

In the VPC for the application deployed in Step 3, you will need to configure VPC Flow Logs and point them to an Amazon S3 bucket. Specifically, you will want to keep it as AWS default format.

Create the VPC Flow log.

Step 5: Set up Elastic Serverless Forwarder on AWS

Follow instructions listed in Elastic’s documentation and refer to the previous blog providing an overview. The important bits during the configuration in Lambda’s application repository are to ensure you:

Specify the S3 Bucket in ElasticServerlessForwarderS3Buckets where the VPC Flow Logs are being sent. The value is the ARN of the S3 Bucket you created in Step 4.
Specify the configuration file path in ElasticServerlessForwarderS3ConfigFile. The value is the S3 url in the format "s3://bucket-name/config-file-name" pointing to the configuration file (sarconfig.yaml).
Specify the S3 SQS Notifications queue used as the trigger of the Lambda function in ElasticServerlessForwarderS3SQSEvents. The value is the ARN of the SQS Queue you set up in Step 4.

Once Amazon CloudFormation finishes setting up Elastic serverless forwarder, you should see two Amazon Lambda functions:

In order to check if logs are coming in, go to the function with “ ApplicationElasticServer ” in the name, and go to monitor and look at logs. You should see the logs being pulled from S3.

Step 6: Check and ensure you have logs in Elastic

Now that steps 1–4 are complete, you can go to Elastic’s Discover capability and you should see VPC Flow Logs coming in. In the image below, we’ve filtered by Amazon S3 bucket bshettisartest.

Conclusion: Elastic Observability easily integrates with VPC Flow Logs for analytics, alerting, and insights

I hope you’ve gotten an appreciation for how Elastic Observability can help you manage AWS VPC Flow Logs. Here’s a quick recap of lessons and what you learned:

A walk-through of how Elastic Observability provides enhanced analysis for VPC Flow Logs:
- Using Elastic’s Analytics Discover capabilities to manually analyze the data
- Leveraging Elastic Observability’s anomaly features to:
  - Identify anomalies in the VPC flow logs
  - Detects anomalous log entry rates
  - Automatically categorizes log messages
- Using an OOTB dashboard to further analyze data
A more detailed walk-through of how to set up the Elastic Serverless Forwarder

Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.