Updated: Dec 16, 2021
Telegraf, InfluxDB and Grafana; together known as the TIG stack, are a set of open source tools that can be used in metric collection, storage and visualizations. One common use case of the TIG stack is VMware infrastructure monitoring.
Telegraf, which is a metrics collection tool, has an input plugin dedicated for collecting information from a configured vCenter server. All the metrics available in the vCenter SDK (Visible to us via the vCenter); such as Host performance metrics, VM performance metrics, vSAN performance metrics etc; are all available on the telegraf vCenter plugin. For more information on this, refer to this url: "VMware vSphere Telegraf Input Plugin | InfluxData".
InluxDB is a time-series database tool, which can store time-series data for a period of time configured in the database's retention policy. This allows us to have a historical view of our datacenter metrics. Telegraf has a output plugin which allows us to send and store the metrics collected by Telegraf to the InfluxDB database. For more information, refer the URL: "InfluxDB: Open Source Time Series Database | InfluxData"
Grafana is an open source visualization tool, that queries compatible plugins to display information on a graphical user interface. Grafana can query InfluxDB databases which allows us to visualize the time series information stored. Grafana is a highly user-friendly tool, with wide range of visual plugins, that allows end users to generate creative infrastructure monitoring dashboards. For more information, refer to this url: "VMware vSphere - Overview dashboard for Grafana | Grafana Labs". This is a pre-built TIG stack dashboard.
The TIG stack can be entirely setup on an on-premise VM. But what would you do if you would want to view our infrastructure metrics when you are away from your on-premise datacenter, example when travelling or when working remotely? You can off-course configure VPNs and remotely access your TIG stack VM. However, configuration and maintenance of this is quite complicated and not very cost effective. In addition, if only discussing around homelabs or SMEs; where the on-premise internet access is from an ISP provider, majority of ISP providers do not even allow you to open tcp ports on the ISP provided router; so VPN configuration is out of question.
Further, in the event of an on-prem infra failure/disaster scenario; where you would need the historical metrics information for investigation and troubleshooting, since your monitoring VM/setup was also hosted on the same on-prem servers; you may loose access to this critical information.
The real potential of the TIG stack is in the ability to store historical information and being able to use that information for troubleshooting; which means storage. This means you would need to provision for storage in your on-premise network/infrastructure. These are my primary reasons for selecting the architecture discussed on this post for on-prem metrics monitoring. If you are working on a hybrid cloud environment, then this solution provides centralized metrics visualization for on-prem as well as cloud infrastructure.
Refer to the basic architecture schematic shown in the diagram below:
Key components of the solution:
On-Premise gateway VM [AWSGW]: This is a Windows Server 2022 VM with "Remote Access" feature configured. This provides us with a NAT interface for connecting the homelab network to the internet. All incoming traffic is blocked, all out going traffic and their response is allowed through. The AWSGW VM has 2 NICs, one connected to the internal network "VMNW" and the other connected to the ISP router "EXTNW". Telegraf is installed on this VM and configured to collect metrics from the on-prem vCenter and send the information to InfluxDB installed on AWS Windows Server 2022 EC2 instance.
AWS EC2 Instance: This is a Windows Server 2022 EC2 "t2.micro" instance. InfluxDB and Grafana are installed on this instance. The on-prem telegraf sends the metrics collected to the InfluxDB on this EC2 instance via the ElasticIP associated to this instance. The Grafana web-ui can be accessed from anywhere over the internet using the ElasticIP associated to this instance; allowing us to visualize the on-prem metrics.
Now that we have briefly looked into the architecture, let us look into the details of configuring all the components of this solution. For simplicity, I will break down the configuration into three major parts:
1- AWS configurations
2-Install Grafana and InfluxDB on the EC2 Instance
3- On-premise configurations
1. AWS Configurations:
- Log in to the AWS management console and navigate to the VPC management console
- Click on "Internet gateway" and create a custom gateway as shown in the snip below:
- Click on "Your VPCs" and create a custom VPC with configurations similar to that shown in the snip below, set your desired CIDR for the VPC:
- Once the VPC is created go back to "Internet Gateway" and select the internet gateway you created. Attach this Internet gateway to the VPC created in the previous step.
- Create a subnet and attach it to the VPC created you created. Ensure the CIDR block falls within the CIDR block of the VPC, configured in the previous step:
- Create a route table, and attach it to the VPC previously created. Add the subnet previously created to this route table's explicit subnet association. In the routing table, add the internet gate way with destination IP of 0.0.0.0/0 (all IPs); as shown below. This makes the subnet a Public Subnet as it is directly routed to the Internet gateway:
- Now that our AWS network configurations are completed, we can go ahead and create the Windows EC2 Instance. This can also be setup on a Linux instance, but since this was a test setup, for my clear understanding, I set this up on a Windows instance.
- From AWS console, go to EC2 console.
- Choose "Launch Instance". Select the "Windows Server 2022 Base" AMI, as used in this example.
- Select the "t2.micro" instance and click on "Configure Instance Details"
- In the "Network" drop down, select the VPC you created, in the "subnet" drop-down, select the subnet you created.
- Click on "Add storage"
- Select size as per your requirement for number of day you would want to retain the influxDB metrics data on the disk. I had selected the default "30" GB as this was a test.
- Click on "Add Tags" and leave defaults. Click on "Configure Security Group". The RDP port is present by default. Under source drop down, select "My IP". Add two more rules. One for port 8086, which is the TCP port on which InfluxDB listens to a collector for information to be stored, under the source drop down select "My IP". Similarly add rule for port 3000, which is the port on which the Grafana web UI is accessible, in my example I have selected "My IP" in source for this as well, however, you can add more IPs based on the location from where you would want to access the grafana ui. I would advise not to open the port to 0.0.0.0/0 i.e all source, as this makes the solution less secure:
- Now, click on "Review and launch". Wait for the instance state to reach 'running'
- Now go back to your VPC Console and click on "Elastic IP". Click on "Allocate Elastic IP" and create and Elastic IP.
- Click on the Elastic IP you just created; select action->Associate Elastic Ip Address->select the Windows instance you had created and then click on associate.
- Your Instance should be similar to the snip below:
- Now Click on connect and obtain the RDP password.
2. Install Grafana and InfluxDB on the EC2 instance
- Connect to your EC2 instance via RDP using the Elastic IP address and the credentials obtained.
- To download and install the Grafana on the EC2 instance, from EC2 instance access this link: "Download Grafana | Grafana Labs" and click on "Download the installer". Once the MSI file is downloaded, simply run the MSI file and grafana server is installed as a service on the instance.
- To download and install InfluDB, open a PowerShell console in administrator mode and type in the below command:
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.10_windows_amd64.zip -UseBasicParsing -OutFile influxdb-1.8.10_windows_amd64.zip Expand-Archive .\influxdb-1.8.10_windows_amd64.zip -DestinationPath 'C:\Program Files\InfluxData\influxdb\'
- To run the InfluxDB server, run the following command on the PowerShell window post the InfluxDB download. Keep this command prompt window open, closing this window would stop the influxDB server.
cd 'c:\Program Files\InfluxData\influxdb' .\influxd.exe
- Open windows firewall and allow the TCP ports 8086 (to influxd.exe only) and 3000 (to grafana only).
- This completes the setup of InfluxDB and Grafana on your EC2 instance.
3. On-premise configurations:
- The AWSGW VM is the VM where our telegraf service will be running. From network point of view, the essential configuration is the setting up of "Remote Access" feature on this VM. This has been discussed in detail in my blog post: "Windows Sever VM as a Homelab to Internet router!! (virtualmystery.info)". In this section, we will be taking a closer at the telegraf collector tool configurations. If not already connected to the internet; refer the above post before proceeding to the next steps.
- RDP to your on-prem gateway VM, in this example AWSGW. Download the telegraf tool using the following commands:
wget https://dl.influxdata.com/telegraf/releases/telegraf-1.20.4_windows_amd64.zip -UseBasicParsing -OutFile telegraf-1.20.4_windows_amd64.zip Expand-Archive .\telegraf-1.20.4_windows_amd64.zip -DestinationPath 'C:\Program Files\InfluxData\telegraf'
- Navigate to "C:\Program Files\InfluxData\telegraf". Right click on telegraf.conf and open with your favorite text editor, e.g. notepad++. Uncomment and append the following lines (in Bold font):
# Configuration for sending metrics to InfluxDB [[outputs.influxdb]] ## The full HTTP or UDP URL for your InfluxDB instance. ## ## Multiple URLs can be specified for a single cluster, only ONE of the ## urls will be written to each interval. # urls = ["unix:///var/run/influxdb.sock"] # urls = ["udp://127.0.0.1:8089"] urls = ["http://<Your EC2 instance public Elastic IP>:8086"]
## The target database for metrics; will be created as needed. ## For UDP url endpoint database needs to be configured on server side. database = "<Name for your metrics DB>"
# # Read metrics from VMware vCenter [[inputs.vsphere]] # ## List of vCenter URLs to be monitored. These three lines must be uncommented # ## and edited for the plugin to work. vcenters = [ "https://<On-prem vcenter IP>/sdk" ] username = "<vcenter sso user>" password = "<SSO Password>"
As you scroll down the conf file, you will find all the vCenter metrics available on telegraf. Uncomment all the metrics you would want to visualize, and then proceed to uncomment below line:
# ## Optional SSL Config # # ssl_ca = "/path/to/cafile" # # ssl_cert = "/path/to/certfile" # # ssl_key = "/path/to/keyfile" # ## Use SSL but skip chain & host verification insecure_skip_verify = true
- We are done with the telegraf collector conf file. Save this file and close the text editor window.
- Open a PowerShell Console in administrator mode and type in the below commands:
C:\"Program Files"\InfluxData\telegraf\telegraf.exe --service install --config "C:\Program Files\InfluxData\telegraf\telegraf.conf" net start telegraf
- The telegraf service will start collecting information from the configured vCenter server and send it to the InfluxDB database
- RDP back to your EC2 instance, the InfluxDB server command prompt window should be similar to the below snip, where the Database you configured to be written onto from telegraf starts receiving information; from your router's IP address:
- Now, from any device in your home network access: http://<Elastic IP of your instance>:3000. You should reach your grafana login page. Login with default credentials and change password at prompt.
- On grafana home page, go to sources->select influxDB and configure your local (EC2 instance) InfluxDB database as the Grafana datasource, as shown in the snip below:
- Save the configuration and go to explore, you should be able to see your homelab metrics from the data selection options:
You can configure the Grafana's vcenter dashboard, to visualize your lab/om-prem metrics.