Benjamin Kušen
January 8, 2024

How to Collect Logs with Vector in Kubernetes

Want to know how to collect logs with Vector in Kubernetes? Then keep reading, and grab a pen.

A lot of businesses have been making efficient use of Vector for collecting logs through Kubernetes for years. The bigger community of K8s users will definitely one day realize the power of vectors and acknowledge the advantages of their deployments. In this article, we'll talk about the kinds of information that can be stored in Kubernetes. After that, we will dive into understanding Vector along with its architecture. Lastly, we will explain the real use cases and practical experiences of making use of Vector.

Kubernetes Logging

The main goal of Kubernetes.is running containers on nodes. These containers are traditionally designed as per the 12 factors of Heroku. In the upcoming section, we will unfold how logs are produced.

(Pods’)Logs of Applications

The main applications that are operated through K8s write their logs to stderr or stdout. As a result, the container runtime will then collect and store the logs in its directory which is generally in the /var/log/pods.  It can be customized or configured as per specific needs.

Image of pods in kubelet schema

Logs of Node Services

There are some services in Kubernetes nodes that operate outside containers for example as kubelet and containerd. It is essential that such services are considered and relevant messages can be collected from syslog i.e. the SSH authentication messages. Also, there are times when containers may write the logs to certain file paths. For instance, kube-apiserver mostly writes the type of audit logs. As a result, such written logs should be retrieved from the relevant nodes.

Events in Log Collection

The events in the Kubernetes log collection show a unique structure that only prevails in etcd. If you want to gather them then you have to demand a request from Kubernetes API.

Events can be considered as metrics because of the reason field which has the capability to identify the event along with the count field. It operates as a counter which keeps on incrementing as further events are logged in. Events can also be gathered as traces that showcase the lastTimestamp and firstTimestamp fields which help in creating detailed Gantt diagrams that depict all events in a particular cluster. Events provide easy-to-understand human-like messages in the message field. They can be gathered as logs.

<pre class="codeWrap"><code>apiVersion: v1
kind: Event
count: 1
metadata:
 name: standard-worker-1.178264e1185b006f
 namespace: default
reason: RegisteredNode
firstTimestamp: '2023-09-06T19:08:47Z'
lastTimestamp: '2023-09-06T19:08:47Z'
involvedObject:
 apiVersion: v1
 kind: Node
 name: standard-worker-1
 uid: 50fb55c5-d97e-4851-85c6-187465154db6
message: 'Registered Node standard-worker-1 in Controller'
</code></pre>

Most importantly Kubernetes can easily collect events, node service logs, and pod logs. In our article, we will talk about node service logs and pod logs because events need supplementary software for scraping them which entails Kubernetes API which surpasses our scope.

Understanding Vector

It is important that we first look into what a vector is.

Unique Features of Vector

Vector is a lightweight and fast tool that is employed for constructing observability pipelines. As per various features of Vector, it can also be defined as an open source software that helps build log-collection pipelines.

Vendor Agnosticism is another salient feature of Vector. Even though Vector is owned by Datadog, it is easily configurable with other vendor solutions i.e. Elasticsearch Cloud, Grafana Cloud, and Splunk. This is a flexible feature that makes sure that a sole software solution can be used among various vendors. One of the outstanding features of Vector is that you don't need to rewrite the Go application in Rust to upgrade its performance. Vector is by default written in Rust.

In addition, Vector is highly performant. Vector makes use of a CI system that operates benchmarking tests on any declared pull requests. The maintainers actively calculate the effect the new features have on the performance of Vector. When an adverse effect occurs the team is requested to resolve the concern quickly as performance is a crucial parameter for Vector.

The Architecture of Vector

Vector gathers data from multiple sources. It performs this function by either acting or scraping as an HTTP server which collects data which is consumed by other tools. Vector efficiently transforms log entries and can alter, aggregate, or drop numerous messages into one. Subsequent to this transformation Vector can process and can forward the messages to a queue or storage system.It can be summarized that vector architecture can collect logs and send them after transforming them.

Image of basic Vector architecture schema

Vector incorporates a powerful transformation language known as the Vector Remap Language (VRL), allowing for an infinite number of possible transformations.

Vector Remap Language Cases

We will now consider log filtering to understand VRL. Below you can notice that we have utilized the VRL form of expression for ensuring that the severity field is not equal to the info.

<pre class="codeWrap"><code>[transforms.filter_severity]
type = "filter"
inputs = ["logs"]
condition = '.severity != "info"'
</code></pre>

Once Vector gathers pod logs it also enhances the long lines by using additional pod metadata i.e pod labels, pod name, and pod IP. Still, the pod labels may contain labels that only Kubernetes controllers may use which may be of null value to humans. In order to get efficient performance we suggest that you drop such labels.

<pre class="codeWrap"><code>[transforms.backslash_multiline]
type = "reduce"
inputs = ["logs"]
group_by = ["file", "stream"
]merge_strategies."message" = "concat_newline"
ends_when = '''
 matched, err = match(.message, r'[^\]$');
 if err != null {
   false;
 } else {
  matched;
 }
'''
</code></pre>

Now we will outline a case depicts that how numerous log lines could be concatenated into a single line:

<pre class="codeWrap"><code>[transforms.backslash_multiline]
type = "reduce"
inputs = ["logs"]
group_by = ["file", "stream"]
merge_strategies."message" = "concat_newline"
ends_when = '''
 matched, err = match(.message, r'[^\]$');
 if err != null {
   false;
 } else {
   matched;
 }
'''</code></pre>

In the mentioned example the field of  merge_strategies adds a newline symbol in the message field. Along with that  ends_when section utilizes a VRL expression for monitoring to see if the line finishes with a backslash (in the manner multiline Bash comments are concatenated).

The Topologies of Log Collecting

Now we will explain some of the topologies of log collecting that can be used with Vector.

Distributed Topology

In this topology, the vector agents are installed on the whole set of nodes inside the cluster of Kubernetes. These agents then gather, transform, and then send to the storage directly.

Centralized Topology

In this type of topology, the Vector agents also operate on the whole set of nodes but they don't function as tedious transformations. The aggregators handle those complex transformations. The appealing aspect of this sort of setup is that it can handle heavy loads. You can easily employ assigned nodes for aggregators which can be scaled whenever needed which will optimize resource consumption of Vector on the cluster nodes.

Stream-Based Approach

Under this approach, the Kubernetes pods get themselves free from logs as soon as possible. The logs are directly inserted into the storage. After that, the Elasticsearch analyzes the log lines and then accordingly modifies the indexes. It can be a resource-allinclusive process. In the example of Kafka, messages are regarded merely as strings. Thus those logs can be conveniently retrieved from Kafta for subsequent analysis and storage purposes. In our current discussion, those topologies will not be covered. Here we will only consider its part as an agent for log collection on cluster nodes.

Vector and Kubernetes

What does Vector look like in Kubernetes? Let us show you.

Image of Vector in Kubernetes schema

The outlined design appears somewhat complex at first glance but there are some objectives behind it.

There are three containers in that pod:

  • The first container operates Vector by itself. Its primary aim is to collect logs.
  • The second container is termed the Reloader. It permits the platform users to generate their ingestion pipelines and log collection. We have a special operator that presumes rules as user-defined and which then for Vector forms a configuration map. The Reloader container then checks that configuration map and then reloads Vector correspondingly.
  • The third container is termed as Kube RBAC proxy. It plays a major part when Vector reveals numerous metrics pertinent to the collected log lines. The generated information is quite sensitive therefore it is mandatory that it is safeguarded with adequate authorization.

Vector is deployed as a DaemonSet because we have to deploy its agents on all the nodes in the Kubernetes cluster.

To make sure that logs are collected efficiently there is a need for mounting more directories to Vector:The /var/log directory is to be mounted because all pods’ logs are gathered there. Apart from that there is a need to mount a persistent volume to Vector, which helps for storing checkpoints. Every time the Vector transmits a log line, it records a checkpoint to avoid logs duplication sent to the same storage. Also, we will mount the localtime to spot the time zone of the node.

<pre class="codeWrap"><code>apiVersion: apps/v1
kind: DaemonSet
volumes:
- name: var-log
 hostPath:
   path: /var/log/
- name: vector-data-dir
 hostPath:
   path: /mnt/vector-data
- name: localtime
 hostPath:
   path: /etc/localtime
volumeMounts:
- name: var-log
 mountPath: /var/log/
  readOnly: true
terminationGracePeriodSeconds: 120
shareProcessNamespace: true
</code></pre>

There are some more nodess to observe:

  • When you are mounting the /var/log directory, it is essential to keep in mind to authorize the readOnly mode. This is a preventive step to make sure that only authorized modification of local files can occur.
  • A grace period of termination is used to confirm that the Vector finishes all assigned tasks before it reboots.
  • The process namespace has to be shared with the Reloader which will enable to transmission of a signal to the Vector for restarting it.

Now our next section will will bw all about use cases. The presented cases are not all hypothetical.

Use Cases

Disk Space Issue

Suddenly the whole set of pods was rejected from a node because of limited space problems. An investigation was started and it was known that the Vector stores deleted files. How can this happen?

  • Vector detects files in the /var/log/pods directory.
  • The application will always write the logs. The size of the file While the application is actively writing logs, the file size grows to spread over the limit of 10 megabytes 40, 50, and 60… megabytes.
  • Sometimes Kubelet shuffles the log file and resizes to its original one of  10 megabytes.
  • At the same time, the Vector attempts to transmit logs to Loki. Loki is unable to handle huge data amounts.
  • Vector is an effective software, that still aims to transfer the whole set of logs to the storage.

The applications operate speedily and they don't regard the internal functions to end. They just keep operating. Consequently, the Vector tries to safeguard all the log files. Further, Kubelet keeps on rotating them, and the present storage space in the node diminishes.

There are a number of ways to resolve this problem which are:

  • You can initiate by making adjustments in the buffer settings. In its default settings, the capacity of the buffer is confined to merely 1,000 messages which is quite low. It can be increased to around 10,000.
  • Further, you can alter the behavior from blocking to new logs might also be beneficial. By the drop newest behavior, the Vector would let go of any logs that it cannot hold in its buffer.
  • You can also choose to utilize a disk buffer instead of using a memory buffer. The disadvantage of Vector is that it is more time-consuming in terms of its input-output operations. In this aspect, you have to reconsider your performance needs while deciding if this approach is suitable for you.

You can eliminate this problem by employing a stream topology. When you permit the logs to leave the node soon, you may decrease the disruptions hazard in the production applications. This will not harm the production cluster because of the monitoring problems. Lastly, you can make use of sysctl for adjusting the highest number of open files required for a process. However, this approach should not be used immensely.

The Prometheus Case

Vector functions on a node and it can perform numerous tasks. It can gather logs from logs and further expose metrics i.e. the number of log lines gathered and the times errors occur. This is feasible due to the exceptional observability feature of Vector. Still, there are numerous metrics that contain a particular file label that has the potential of leading to a high cardinality that Prometheus can’t absorb.

It happens when the log is restarting the Vector initiates showing the metrics for the new pods while still keeping the metrics for the former ones. It means that such metrics have unique file labels. This behavior occurs because of the manner the Prometheus exporter operates due to its design. After numerous pods’ have restarted this circumstance generates a sudden rush in the load of Prometheus and later it results in an “explosion”.

In order to resolve the problem, you can apply the metrics labeling rule to remove the bothering file label. It resolved the issue for Prometheus and then it can operate typically. Still, after some time, the Vector confronted another problem. Vector was making use of excessive memory for storing more such metrics which resulted in a memory leak. To resolve this problem, the option of global option in Vector was applied called expire_metric_secs: If you set it to 60 seconds for example then the Vector will be monitoring if the data is being collected from such pods. If it does not happen then it will stop exporting metrics from such files. Although this solution worked effectively, it also impacted Vector component error. You can observe in the graph presented below that 7 errors were preliminarily encountered.

Image of Iluustration of Prometheus of Vector error

Prometheus especially the function of PromQL rate and any other similar such function cannot manage such large data gaps. On the other hand, Prometheus makes sure that the metric is exposed for the whole time span. In order to solve this problem, the code of Vector was been modified to eliminate the label of the file entirely. It does so by only eliminating the ”file” entry in a couple of places. This methodology was efficient in solving the problem.

Kubernetes Control Plane Outage

One day it was observed that the control plane of Kubernetes failed when the instances of Vector restarted concurennetly. When the dashboards were being analyzed it could be seen that the problem started due to increased memory use mainly and etcd memory consumption.

Image of screenshot of Vector control panel otage

In order to find the root cause of the problem, the team dived into the internal operations of the Kubernetes API.

Upon starting the Vector instance, it generates a LIST request to the Kubernetes API  for populating the cache with pod metadata. Vector depends on its metadata for enriching lo entries. As a result, every Vector was requesting the Kubernetes API to give the metadata for pods by using the same node where the Vector was running.

It's imporant to know that etcd operates as a key-value database. The keys are made up of the resource’s name, namespace, and kind. For each request comprising of 110 pods on a node, the Kubernetes API approaches etcd, and then fetches all the data of pods. It causes increased use of memory for etcd and kube-apiserver, which causes it to fail.

You can apply 2 possible solutions to this problem You can apply the cache read approach. Using this approach, you are instructing the API server for reading from its present cache alternative of etcd. It is possible that instabilities may emerge in specific conditions which is an acceptable norm for such tools. This kind of feature was not present in the Kubernetes Rust client.

Hence, we we yielded a pull request to Vector, which enabled the option of use_apiserver_cache=true .The 2nd choice comprises capitalizing on Kubernetes Priority and Fairness API features. You can outline a request queue:

<pre class="codeWrap"><code>apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
kind: PriorityLevelConfiguration
metadata:
 name: limit-list-custom
spec:
 type: Limited
 limited:
   assuredConcurrencyShares: 5
   limitResponse:
     queuing:
       handSize: 4
       queueLengthLimit: 50
       queues: 16
     type: Queue
</code></pre>

......and you can link it with certain service accounts:

<pre class="codeWrap"><code>apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
kind: FlowSchema
metadata:
 name: limit-list-custom
spec:
 priorityLevelConfiguration:
   name: limit-list-custom
 distinguisherMethod:
   type: ByUser
 rules:
 - resourceRules:
    - apiGroups: [""]
     clusterScope: true
     namespaces: ["*"]
     resources: ["pods"]
     verbs: ["list", "get"]
   subjects:
   - kind: ServiceAccount
     serviceAccount:
       name: ***
       namespace: ***
</code></pre>

This type of configuration permits limiting the amount of concurrent preflight requests which efficiently decreases the gravity of fluctuations which reduces their effect.Finally, instead of relying on the Kubernetes API, you can use the kubelet API to obtain pod metadata by sending requests to the /pods endpoint. However, this feature has not yet been implemented in Vector.

Lastly, you can use kubelet API for gaining pod metadata by requesting to the /pods endpoint. Still, this unique feature is not employed in Vector.

Summing Up

Vector is a wonderful platform for engineering ventures. It offers a great deal of uniqueness, flexibility, and log ingestion breadth along with transfer options. You can reap the full benefits of its robust benefits and features.

Facing Challenges in Cloud, DevOps, or Security?
Let’s tackle them together!

get free consultation sessions

In case you prefer e-mail first:

Thank you! Your message has been received!
We will contact you shortly.
Oops! Something went wrong while submitting the form.
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information. If you wish to disable storing cookies, click here.