Skip to content

UMNET-perfSONAR/pssid-data-pipeline

Repository files navigation

pSSID Data Analytics Pipeline

A data analytics pipeline for pSSID that receives, stores, and visualizes WiFi test metrics gathered by Raspberry Pi WiFi probes.

data-pipeline-selected data-pipeline-architecture

Picture on the left is an overview of the entire pSSID architecture with the role of this data analytics pipeline highlighted. In short, it receives test results (metrics) gathered by the probes, stores and visualizes them.

Picture on the right is the architecture of the pipeline itself. It leverages the idea of the ELK stack, simply replacing Elasticsearch and Kibana with Opensearch and Grafana, resepctively.

Requirements

The setup of the pipeline assumes that you have a virtual machine running Ubuntu 22 and that the machine has Docker installed. If not, you could install it with

sudo apt update && sudo apt install docker.io docker-compose -y

Installation

  1. Clone this repository to the machine you would like to host the pipeline on. Each service has its own docker-compose file for better modularization. If demand changes, say you need more Opensearch nodes, you could simply provision more nodes without touching other components of the pipeline.

  2. Set passwords for Opensearch, which is required since version 2.12.0. The easiest way to do so is with environment variables. Add the following lines to your .bashrc file. This documentation uses admin as the username and OpensearchInit2024 as the password for demonstration.

export OPENSEARCH_INITIAL_ADMIN_PASSWORD=OpensearchInit2024
export OPENSEARCH_USER=admin
export OPENSEARCH_PASSWORD=OpensearchInit2024

⚠️ These variables are consumed by opensearch-one-node.yml and logstash.yml, so it is not recommended that you change the variable names unless there is a good reason. You could freely change their values.

Don't forget to run

source ~/.bashrc

to load the environment variables.

⚠️⚠️ Note that this approach with environment variables requires that you do not run docker-compose with sudo, since the root user cannot read the environment variables defined by non-root users. Make sure the current user is in the docker group so that you can directly run docker-compose without sudo. Add yourself to the docker group and activate it by running the following command.

sudo usermod -aG docker ${USER} && newgrp docker

⚠️⚠️Opensearch requires vm.max_map_count to be at least 262144. Check your current value by running

sysctl vm.max_map_count

and if it is too low, say 65530 by default on some machine, edit the /etc/sysctl.conf file and add the following

vm.max_map_count=262144

Apply the change

sudo sysctl -p
  1. Configure Logstash. Create a directory on the host machine, say logstash-pipeline, with at least a logstash.conf file in it. logstash.conf contains input, output sources, and custom filters you would like to implement. A sample file is provided inside the directory logstash-pipeline. You could use it as your pipeline directory and add more .conf files to it.

Open logstash.yml and edit the following TODO item.

Mount the directory you just created to the pipeline directory inside the container.

# TODO: mount your pipeline directory into the container. USE ABSOLUTE PATH!
- <ABS_PATH_TO_YOUR_PIPELINE_DIRECTORY>:/usr/share/logstash/pipeline
  1. No configuration is required for Grafana.

  2. Start the three components of the service with docker-compose.

docker-compose -f <path-to-opensearch.yml> up -d
docker-compose -f <path-to-logstash.yml> up -d
docker-compose -f <path-to-grafana.yml> up -d

OPTIONAL: you could also start the opensearch dashboard in the same way.

docker-compose -f <path-to-opensearch-dashboard.yml> up -d

By default, Logstash listens for Filebeat input at port 9400, Opensearch listens for Logstash input at port 9200, Grafana dashboard is hosted at port 3000, and the optional Opensearch dashboard is hosted at port 5601. Make sure the firewall settings allow external traffic to ports 9400, 3000, and 5601.

Usage

logstash.conf

This file contains the input source, custom filters, and output destination. See the sample file for more details. The input and output fields generally require minimal changes, if any. Most of the customization is done in the filter field. You could implement as many filters as you like, and a more complicated filtering at the Logstash level usually results in simpler configuration later at the Grafana level.

The sample file contains a single pipeline with multiple filters applied. Refer to the official documentation for more advanced examples with multiple pipelines.

Filebeat

On each WiFi probe, install Filebeat. Refer to the documentation here.

Then open the configuration file /etc/filebeat/filebeat.yml and edit the following fields.

Specify the input source for Filebeat, which is the output destination of pSSID. In the following example, test results gathered by pSSID are written to /var/log/pssid.log on the probe.

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/pssid.log

Comment out the output section for Elasticsearch and uncomment the one for Logstash.

output.logstash:
  hosts: ["<pipeline-hostname>:9400"]

Grafana

Naviagte to the Grafana dashboard at <pipeline-hostname>:3000. By default, Grafana username and password are both admin. To add a data source, select Opensearch in the list of available sources and configure as follows. add-data-source Remarks:

  • URL: use https instead of http, and check Basic auth and Skip TLS Verify under the Auth section. User and Password under Basic Auth Details are OPENSEARCH_USER and OPENSEARCH_PASSWORD defined earlier, which are admin and OpensearchInit2024 in our example. Also make sure to use the Docker aliased hostname opensearch-node1 instead of the actual hostname of your pipeline machine.
  • Index name: wild card patterns are allowed here. To see the list of all Opensearch indices, run
curl -u <OPENSEARCH_USER>:<OPENSEARCH_PASSWORD> --insecure \
    "https://localhost:9200/_cat/indices?v"

on the pipeline machine.

  • Click on Get Version and Save, which should automatically populate the Version and Max concurrent Shard Requests fields, indicating a successful configuration.

Having configured the data sources, now you could create visualization panels and dashboards.

visualization-example

Releases

No releases published

Packages

No packages published