diff --git a/docs/source/ops/generating-keys.rst b/docs/source/ops/generating-keys.rst index afe2ece4009f761aea56acd24fcbf627b985cadb..47a1f4e91a876ac1919252c099654886f0bd128a 100644 --- a/docs/source/ops/generating-keys.rst +++ b/docs/source/ops/generating-keys.rst @@ -1,7 +1,28 @@ Generating keys =============== -``config.json`` has the paths for the Ristretto and the Stripe secret key files. +There's an example ``secrets`` repo in ``morph/grid/local/secrets``. +``<grid>/config.json`` has the paths for the key files for the respective grid. +Create a symlink named ``secrets`` to your secret key repository for the deployment you are working on. + + +Stripe +`````` + +For the Stripe key any random bytes with a little light formatting "work" - at least to make our software happy - but if you want to be able to interact with Stripe and have payments (even pretend payments) move all the way through the system you should get a Stripe account and generate a key w/ them. +Lauri can get you added to our "dev" Stripe account, too, though I forget how important that is for ad hoc dev/testing. + +I think this will work for generating random Stripe secret keys (that our software will load, I think, but Stripe will reject):: + + >>> import base64, os + >>> print((b"sk_test_" + base64.b64encode(os.urandom(25)).strip(b"=")).decode("ascii")) + sk_test_Dr+XLVjkC0oO3Zw8Ws0yWtDLqR1sM+/fmw + +Public keys are the same but "pk_test" instead of "sk_test" ("test" is for "test mode" key that can only process pretend txns; for real txns there are keys with "live" embedded). + + +ZKAP-Issuer Ristretto +````````````````````` Here is a Ristretto key you can use, randomly generated just now:: @@ -19,16 +40,9 @@ For example:: echo -n "SILOWzbnkBjxC1hGde9d5Q3Ir/4yLosCLEnEQGAxEQE=" > ristretto.signing-key -For the Stripe key any random bytes with a little light formatting "work" - at least to make our software happy - but if you want to be able to interact with Stripe and have payments (even pretend payments) move all the way through the system you should get a Stripe account and generate a key w/ them. -Lauri can get you added to our "dev" Stripe account, too, though I forget how important that is for ad hoc dev/testing. - -I think this will work for generating random Stripe secret keys (that our software will load, I think, but Stripe will reject):: - - >>> import base64, os - >>> print((b"sk_test_" + base64.b64encode(os.urandom(25)).strip(b"=")).decode("ascii")) - sk_test_Dr+XLVjkC0oO3Zw8Ws0yWtDLqR1sM+/fmw -Public keys are the same but "pk_test" instead of "sk_test" ("test" is for "test mode" key that can only process pretend txns; for real txns there are keys with "live" embedded). +ZKAP-Issuer TLS +``````````````` The ZKAPIssuer.service needs a working TLS certificate and expects it in the certbot directory for the domain you configured, in my case:: @@ -37,14 +51,26 @@ The ZKAPIssuer.service needs a working TLS certificate and expects it in the cer Move the three .pem files into the payment's server ``/var/lib/letsencrypt/live/payments.localdev/`` directory and issue a ``sudo systemctl restart zkapissuer.service``. -Create Wireguard VPN key pairs in ``PrivateStorageSecrets/monitoringvpn/`` or where you have them:: - for i in "172.23.23.11" "172.23.23.12" "172.23.23.13" "server"; do - wg genkey | tee ${i}.key | wg pubkey > ${i}.pub +Monitoring VPN +`````````````` + +Create Wireguard VPN key pairs in ``secrets/monitoringvpn/`` or where you have them. + +``tools/create-vpn-keys.sh`` holds a script to rotate all VPN keys at once:: + + ./tools/create-vpn-keys.sh morph/grid/testing/grid.nix + +Or do it manually:: + + cd secrets/monitoringvpn + for i in 1 11 12 13 ; do + wg genkey | tee 172.23.23.${i}.key | wg pubkey > 172.23.23.${i}.pub done + ln -s 172.23.23.1.key server.key + ln -s 172.23.23.1.pub server.pub + And a shared VPN key for "post-quantum resistance":: wg genpsk > preshared.key - - diff --git a/morph/grid/local/README.rst b/morph/grid/local/README.rst index 73bfbbdd2a11922fe696161fe8346d7e10157313..345547244635734278aa76cb5cd59946f2afd37f 100644 --- a/morph/grid/local/README.rst +++ b/morph/grid/local/README.rst @@ -5,10 +5,18 @@ Set up and use a network of local development VMs (The author of this documentation wasted a lot of time trying to get Vagrant to work with KVM/libvirt. Issues with networking that looked like guest misconfigurations vanished after changing to the better-tested combination of Vagrant and VirtualBox.) +This requires `NixOS <https://nixos.org/>`_. +Nix without the OS will not work. Use the local development environment ````````````````````````````````````` +0. Add VirtualBox to your NixOs system configuration at ``/etc/nixos/configuration.nix``:: + + virtualisation.virtualbox.host.enable = true; + # Save bytes and build time, optional but recommended: + virtualisation.virtualbox.host.headless = true; + 1. Enter the morph local grid directory:: cd morph/grid/local @@ -27,7 +35,7 @@ Use the local development environment 5. Edit the generated configuration: Add the ``publicIP`` addresses from ``grid.nix`` to ssh config **Host** match blocks (**not** HostName) so the ``Host`` lines all read like:: - Host payments1 192.168.67.21 + Host payments 192.168.67.21 HostName 127.0.0.1 User vagrant [...] diff --git a/morph/grid/local/Vagrantfile b/morph/grid/local/Vagrantfile index 82bbef1063b108829261670fdceb2e27af8d6764..7ad95ca872a72e5da6c11b3269e2a824cf8a55f9 100644 --- a/morph/grid/local/Vagrantfile +++ b/morph/grid/local/Vagrantfile @@ -8,8 +8,8 @@ Vagrant.configure("2") do |config| # For a complete reference, please see the online documentation at # https://docs.vagrantup.com. - config.vm.define "payments1" do |config| - config.vm.hostname = "payments1" + config.vm.define "payments" do |config| + config.vm.hostname = "payments" config.vm.box = "esselius/nixos" config.vm.box_version = "20.09" config.vm.box_check_update = false @@ -36,8 +36,8 @@ Vagrant.configure("2") do |config| config.vm.network "private_network", ip: "192.168.67.23" end - config.vm.define "monitoring1" do |config| - config.vm.hostname = "monitoring1" + config.vm.define "monitoring" do |config| + config.vm.hostname = "monitoring" config.vm.box = "esselius/nixos" config.vm.box_version = "20.09" config.vm.box_check_update = false diff --git a/morph/grid/local/grid.nix b/morph/grid/local/grid.nix index a762186d3aaad642fe24aaf7666853fde79986b3..fdc0cde55be4f1b644c212ce20f6c3e44af8e3df 100644 --- a/morph/grid/local/grid.nix +++ b/morph/grid/local/grid.nix @@ -7,45 +7,58 @@ import ../../lib/make-grid.nix { nodes = cfg: let sshUsers = import ./secrets/users.nix; - vpnClientIPs = [ "172.23.23.11" "172.23.23.12" "172.23.23.13" ]; # TBD: derive automatically + # Get absolute vpn key directory path, as a string: monitoringvpnKeyDir = toString ./. + "/${cfg.monitoringvpnKeyDir}"; + + # TBD: derive these automatically: + hostsMap = { + "172.23.23.1" = [ "monitoring" "monitoring.monitoringvpn" ]; + "172.23.23.11" = [ "payments" "payments.monitoringvpn" ]; + "172.23.23.12" = [ "storage1" "storage1.monitoringvpn" ]; + "172.23.23.13" = [ "storage2" "storage2.monitoringvpn" ]; + }; + vpnClientIPs = [ "172.23.23.11" "172.23.23.12" "172.23.23.13" ]; + nodeExporterTargets = [ "monitoring" "payments" "storage1" "storage2" ]; + in { - "payments1" = import ../../lib/make-issuer.nix (cfg // rec { + "payments" = import ../../lib/make-issuer.nix (cfg // rec { publicIPv4 = "192.168.67.21"; monitoringvpnIPv4 = "172.23.23.11"; - inherit monitoringvpnKeyDir; - inherit sshUsers; hardware = import ./virtual-hardware.nix ({ inherit publicIPv4; }); stateVersion = "19.03"; + inherit monitoringvpnKeyDir; + inherit sshUsers; }); "storage1" = import ../../lib/make-testing.nix (cfg // rec { publicIPv4 = "192.168.67.22"; monitoringvpnIPv4 = "172.23.23.12"; - inherit monitoringvpnKeyDir; - inherit sshUsers; hardware = import ./virtual-hardware.nix ({ inherit publicIPv4; }); stateVersion = "19.09"; + inherit monitoringvpnKeyDir; + inherit sshUsers; }); "storage2" = import ../../lib/make-testing.nix (cfg // rec { publicIPv4 = "192.168.67.23"; monitoringvpnIPv4 = "172.23.23.13"; - inherit monitoringvpnKeyDir; - inherit sshUsers; hardware = import ./virtual-hardware.nix ({ inherit publicIPv4; }); stateVersion = "19.09"; + inherit monitoringvpnKeyDir; + inherit sshUsers; }); - "monitoring1" = import ../../lib/make-monitoring.nix (cfg // rec { + "monitoring" = import ../../lib/make-monitoring.nix (cfg // rec { publicIPv4 = "192.168.67.24"; monitoringvpnIPv4 = "172.23.23.1"; inherit vpnClientIPs; - inherit sshUsers; - inherit monitoringvpnKeyDir; + inherit hostsMap; + inherit nodeExporterTargets; hardware = import ./virtual-hardware.nix ({ inherit publicIPv4; }); stateVersion = "19.09"; + inherit monitoringvpnKeyDir; + inherit sshUsers; }); }; } diff --git a/morph/grid/production/config.json b/morph/grid/production/config.json index e71cb8b4b5f999e3059f0669c2bc3f92f29242a6..ef7dc53649febcd7beb7901bb3608204df197059 100644 --- a/morph/grid/production/config.json +++ b/morph/grid/production/config.json @@ -1,6 +1,8 @@ { "publicStoragePort": 8898 , "ristrettoSigningKeyPath": "./secrets/ristretto.signing-key" , "stripeSecretKeyPath": "./secrets/stripe.secret" +, "monitoringvpnKeyDir": "./secrets/monitoringvpn" +, "monitoringvpnEndpoint": "monitoring.private.storage:51820" , "passValue": 1000000 , "issuerDomains": [ "payments.privatestorage.io" diff --git a/morph/grid/production/grid.nix b/morph/grid/production/grid.nix index f5735d259dbff27f1d9cabbbca512af81d4550bb..fee0c9be6faed47d4a702b5b53c2419cbb677ba6 100644 --- a/morph/grid/production/grid.nix +++ b/morph/grid/production/grid.nix @@ -7,6 +7,38 @@ import ../../lib/make-grid.nix { nodes = cfg: let sshUsers = import ./secrets/users.nix; + + # Get absolute vpn key directory path, as a string: + monitoringvpnKeyDir = toString ./. + "/${cfg.monitoringvpnKeyDir}"; + + # TBD: derive these automatically: + hostsMap = { + "172.23.23.1" = [ "monitoring" "monitoring.monitoringvpn" ]; + "172.23.23.11" = [ "payments" "payments.monitoringvpn" ]; + "172.23.23.21" = [ "storage001" "storage001.monitoringvpn" ]; + "172.23.23.22" = [ "storage002" "storage002.monitoringvpn" ]; + "172.23.23.23" = [ "storage003" "storage003.monitoringvpn" ]; + "172.23.23.24" = [ "storage004" "storage004.monitoringvpn" ]; + "172.23.23.25" = [ "storage005" "storage005.monitoringvpn" ]; + }; + vpnClientIPs = [ + "172.23.23.11" + "172.23.23.21" + "172.23.23.22" + "172.23.23.23" + "172.23.23.24" + "172.23.23.25" + ]; + nodeExporterTargets = [ + "monitoring" + "payments" + "storage001" + "storage002" + "storage003" + "storage004" + "storage005" + ]; + in { # Here are the hosts that are in this morph network. This is sort of like # a server manifest. We try to keep as many of the specific details as @@ -20,42 +52,66 @@ import ../../lib/make-grid.nix { # doesn't specify one. # # The names must be unique! - "payments.privatestorage.io" = import ../../lib/make-issuer.nix ({ + "payments.privatestorage.io" = import ../../lib/make-issuer.nix (cfg // { publicIPv4 = "18.184.142.208"; + monitoringvpnIPv4 = "172.23.23.11"; + inherit monitoringvpnKeyDir; inherit sshUsers; hardware = ../../lib/issuer-aws.nix; stateVersion = "19.03"; - } // cfg); + }); - "storage001" = import ../../lib/make-storage.nix ({ + "storage001" = import ../../lib/make-storage.nix (cfg // { cfg = import ./storage001-config.nix; inherit sshUsers; hardware = ./storage001-hardware.nix; stateVersion = "19.09"; - } // cfg); - "storage002" = import ../../lib/make-storage.nix ({ + monitoringvpnIPv4 = "172.23.23.21"; + inherit monitoringvpnKeyDir; + }); + "storage002" = import ../../lib/make-storage.nix (cfg // { cfg = import ./storage002-config.nix; inherit sshUsers; hardware = ./storage002-hardware.nix; stateVersion = "19.09"; - } // cfg); - "storage003" = import ../../lib/make-storage.nix ({ + monitoringvpnIPv4 = "172.23.23.22"; + inherit monitoringvpnKeyDir; + }); + "storage003" = import ../../lib/make-storage.nix (cfg // { cfg = import ./storage003-config.nix; inherit sshUsers; hardware = ./storage003-hardware.nix; stateVersion = "19.09"; - } // cfg); - "storage004" = import ../../lib/make-storage.nix ({ + monitoringvpnIPv4 = "172.23.23.23"; + inherit monitoringvpnKeyDir; + }); + "storage004" = import ../../lib/make-storage.nix (cfg // { cfg = import ./storage004-config.nix; inherit sshUsers; hardware = ./storage004-hardware.nix; stateVersion = "19.09"; - } // cfg); - "storage005" = import ../../lib/make-storage.nix ({ + monitoringvpnIPv4 = "172.23.23.24"; + inherit monitoringvpnKeyDir; + }); + "storage005" = import ../../lib/make-storage.nix (cfg // { cfg = import ./storage005-config.nix; inherit sshUsers; hardware = ./storage005-hardware.nix; stateVersion = "19.03"; - } // cfg); + monitoringvpnIPv4 = "172.23.23.25"; + inherit monitoringvpnKeyDir; + }); + + "monitoring" = import ../../lib/make-monitoring.nix (cfg // { + publicIPv4 = "monitoring.private.storage"; + monitoringvpnIPv4 = "172.23.23.1"; + inherit monitoringvpnKeyDir; + inherit vpnClientIPs; + inherit hostsMap; + inherit nodeExporterTargets; + hardware = ../../lib/issuer-aws.nix; + stateVersion = "19.09"; + inherit sshUsers; + }); }; } diff --git a/morph/grid/production/storage000-config.nix b/morph/grid/production/storage000-config.nix deleted file mode 100644 index 2a056a5489245e7e334a26fd9b784097512f6196..0000000000000000000000000000000000000000 --- a/morph/grid/production/storage000-config.nix +++ /dev/null @@ -1,7 +0,0 @@ -{ "interface" = "eno1"; - "publicIPv4" = "69.36.183.24"; - "prefixLength" = 24; - "gateway" = "69.36.183.1"; - "gatewayInterface" = "eno1"; - "grubDeviceID" = "wwn-0x5000c500936410b9"; -} diff --git a/morph/grid/production/storage000-hardware.nix b/morph/grid/production/storage000-hardware.nix deleted file mode 100644 index f0d8c290ddb50162bdb0fee7e0f0ca67cd3a4f5c..0000000000000000000000000000000000000000 --- a/morph/grid/production/storage000-hardware.nix +++ /dev/null @@ -1,37 +0,0 @@ -# Do not modify this file! It was generated by ‘nixos-generate-config’ -# and may be overwritten by future invocations. Please make changes -# to /etc/nixos/configuration.nix instead. -{ config, lib, pkgs, ... }: - -{ - imports = - [ <nixpkgs/nixos/modules/installer/scan/not-detected.nix> - ]; - - boot.initrd.availableKernelModules = [ "ahci" "xhci_pci" "ehci_pci" "megaraid_sas" "usbhid" "sd_mod" ]; - boot.initrd.kernelModules = [ ]; - boot.kernelModules = [ "kvm-intel" ]; - boot.extraModulePackages = [ ]; - - fileSystems."/" = - { device = "/dev/disk/by-uuid/ccabaa39-d888-467e-b8d9-75b5790a91aa"; - fsType = "ext4"; - }; - - fileSystems."/boot" = - { device = "/dev/disk/by-uuid/849c8696-a7e6-42d2-810d-15326d9f9ff6"; - fsType = "ext4"; - }; - - fileSystems."/storage" = - { device = "/dev/disk/by-uuid/2745cbf3-5a63-491d-ab92-6dfd4da1b504"; - fsType = "ext4"; - }; - - swapDevices = - [ { device = "/dev/disk/by-uuid/c6f09c9a-572a-4b0f-b792-412cb5c749d4"; } - ]; - - nix.maxJobs = lib.mkDefault 32; - powerManagement.cpuFreqGovernor = lib.mkDefault "powersave"; -} diff --git a/morph/grid/production/storage003-config.nix b/morph/grid/production/storage003-config.nix index e83546adbcdab2fd35d990a13550dd3907d7226b..5b3f5adf969317322b2c39014e6500294b5f3c02 100644 --- a/morph/grid/production/storage003-config.nix +++ b/morph/grid/production/storage003-config.nix @@ -4,5 +4,5 @@ "prefixLength" = 30; "gateway" = "45.83.89.185"; "gatewayInterface" = "eno1"; - "grubDeviceID" = "wwn-0x5000cca248c31469"; + "grubDeviceID" = "wwn-0x5000039a8bc00766"; } diff --git a/morph/grid/production/storage003-hardware.nix b/morph/grid/production/storage003-hardware.nix index 607943b19117106b532f7c2c2032aea31fce04e3..9882f5372cecd52794e1500bdef30e367008496e 100644 --- a/morph/grid/production/storage003-hardware.nix +++ b/morph/grid/production/storage003-hardware.nix @@ -1,30 +1,31 @@ # Do not modify this file! It was generated by ‘nixos-generate-config’ # and may be overwritten by future invocations. Please make changes # to /etc/nixos/configuration.nix instead. -{ config, lib, pkgs, ... }: +{ config, lib, pkgs, modulesPath, ... }: { imports = - [ <nixpkgs/nixos/modules/installer/scan/not-detected.nix> + [ (modulesPath + "/installer/scan/not-detected.nix") ]; - boot.initrd.availableKernelModules = [ "ahci" "xhci_pci" "ehci_pci" "megaraid_sas" "usbhid" "sd_mod" ]; + boot.initrd.availableKernelModules = [ "ahci" "xhci_pci" "ehci_pci" "megaraid_sas" "usbhid" ]; boot.initrd.kernelModules = [ ]; boot.kernelModules = [ "kvm-intel" ]; boot.extraModulePackages = [ ]; + boot.supportedFilesystems = [ "zfs" ]; fileSystems."/" = - { device = "/dev/disk/by-uuid/daf0b345-97da-46bc-b9df-500d771ec375"; + { device = "/dev/disk/by-uuid/240fc1f6-cd55-48a3-ac80-5b3550a32ef5"; fsType = "ext4"; }; fileSystems."/boot" = - { device = "/dev/disk/by-uuid/a1843705-f4e9-4805-924c-19f464d23da7"; + { device = "/dev/disk/by-label/boot"; fsType = "ext4"; }; # Manually created using: - # zpool create -f -m legacy -o ashift=12 root raidz /dev/disk/by-id/{wwn-0x5000cca249d43969,wwn-0x5000cca248dd1f83,wwn-0x5000cca249d44a67,wwn-0x5000cca249d46730,wwn-0x5000cca25dcc719c,wwn-0x5000cca25dcc0241,wwn-0x5000cca24ac2b2df} + # zpool create -f -m legacy -o ashift=12 root raidz /dev/disk/by-id/{wwn-0x5000cca249d43969,wwn-0x5000039a8bc0075e,wwn-0x5000cca249d44a67,wwn-0x5000cca249d46730,wwn-0x5000cca25dcc719c,wwn-0x5000cca25dcc0241,wwn-0x5000039a8bc00765} fileSystems."/storage" = { device = "root"; fsType = "zfs"; diff --git a/morph/grid/testing/config.json b/morph/grid/testing/config.json index ec28840a2857c621a22658efc14368e4c07aa5db..a44b465f7f293f9d70c369a076c30b6cf810924f 100644 --- a/morph/grid/testing/config.json +++ b/morph/grid/testing/config.json @@ -1,6 +1,8 @@ { "publicStoragePort": 8898 , "ristrettoSigningKeyPath": "./secrets/ristretto.signing-key" , "stripeSecretKeyPath": "./secrets/stripe.secret" +, "monitoringvpnKeyDir": "./secrets/monitoringvpn" +, "monitoringvpnEndpoint": "monitoring.privatestorage-staging.com:51820" , "passValue": 1000000 , "issuerDomains": [ "payments.privatestorage-staging.com" diff --git a/morph/grid/testing/grid.nix b/morph/grid/testing/grid.nix index 065cd5faa5a5e90a657d1fd1a38e79266e6b6475..e31a28f2eb7817f393f4e8b6b71972b7fd2f79f1 100644 --- a/morph/grid/testing/grid.nix +++ b/morph/grid/testing/grid.nix @@ -7,19 +7,48 @@ import ../../lib/make-grid.nix { nodes = cfg: let sshUsers = import ./secrets/users.nix; + + # Get absolute vpn key directory path, as a string: + monitoringvpnKeyDir = toString ./. + "/${cfg.monitoringvpnKeyDir}"; + + # TBD: derive these automatically: + hostsMap = { + "172.23.23.1" = [ "monitoring" "monitoring.monitoringvpn" ]; + "172.23.23.11" = [ "payments" "payments.monitoringvpn" ]; + "172.23.23.12" = [ "storage001" "storage001.monitoringvpn" ]; + }; + vpnClientIPs = [ "172.23.23.11" "172.23.23.12" ]; + nodeExporterTargets = [ "monitoring" "payments" "storage001" ]; + in { - "payments" = import ../../lib/make-issuer.nix ({ + "payments" = import ../../lib/make-issuer.nix (cfg // { publicIPv4 = "18.194.183.13"; + monitoringvpnIPv4 = "172.23.23.11"; + inherit monitoringvpnKeyDir; inherit sshUsers; hardware = ../../lib/issuer-aws.nix; stateVersion = "19.03"; - } // cfg); + }); "storage001" = import ../../lib/make-testing.nix (cfg // { publicIPv4 = "3.120.26.190"; + monitoringvpnIPv4 = "172.23.23.12"; + inherit monitoringvpnKeyDir; inherit sshUsers; hardware = ./testing001-hardware.nix; stateVersion = "19.03"; }); + + "monitoring" = import ../../lib/make-monitoring.nix (cfg // { + publicIPv4 = "18.156.171.217"; + monitoringvpnIPv4 = "172.23.23.1"; + inherit monitoringvpnKeyDir; + inherit vpnClientIPs; + inherit hostsMap; + inherit nodeExporterTargets; + hardware = ../../lib/issuer-aws.nix; + stateVersion = "19.09"; + inherit sshUsers; + }); }; } diff --git a/morph/lib/make-issuer.nix b/morph/lib/make-issuer.nix index 58b8a4f20496472409c2063a2923bc29f161d68a..bbdf0cebbf770738e9ccb997daec75e58df021b5 100644 --- a/morph/lib/make-issuer.nix +++ b/morph/lib/make-issuer.nix @@ -64,6 +64,7 @@ in rec { hardware ../../nixos/modules/issuer.nix ../../nixos/modules/monitoring/vpn/client.nix + ../../nixos/modules/monitoring/exporters/node.nix ]; services.private-storage.sshUsers = sshUsers; diff --git a/morph/lib/make-monitoring.nix b/morph/lib/make-monitoring.nix index c37ea2297088fafba1b97e8d037c378505c3d84c..592a859657e624e8fdf5632f8144c5acc6919e8c 100644 --- a/morph/lib/make-monitoring.nix +++ b/morph/lib/make-monitoring.nix @@ -8,6 +8,9 @@ , monitoringvpnIPv4 ? null , monitoringvpnKeyDir ? null , vpnClientIPs ? null +, nodeExporterTargets ? [] +, nginxExporterTargets ? [] +, hostsMap ? {} , ... }: let enableVpn = monitoringvpnKeyDir != null && @@ -32,6 +35,7 @@ action = ["sudo" "systemctl" "restart" "wireguard-monitoringvpn.service"]; }; }; + in rec { deployment = { @@ -42,6 +46,11 @@ in rec { imports = [ hardware ../../nixos/modules/monitoring/vpn/server.nix + ../../nixos/modules/monitoring/server/grafana.nix + ../../nixos/modules/monitoring/server/prometheus.nix + ../../nixos/modules/monitoring/exporters/node.nix + # Loki 0.3.0 from Nixpkgs 19.09 is too old and does not work: + # ../../nixos/modules/monitoring/server/loki.nix ]; services.private-storage.monitoring.vpn.server = if !enableVpn then {} else { @@ -51,5 +60,18 @@ in rec { pubKeysPath = monitoringvpnKeyDir; }; + services.private-storage.monitoring.grafana = { + domain = "monitoring.private.storage"; + prometheusUrl = "http://localhost:9090/"; + lokiUrl = "http://localhost:3100/"; + }; + + services.private-storage.monitoring.prometheus = { + inherit nodeExporterTargets; + inherit nginxExporterTargets; + }; + system.stateVersion = stateVersion; + + networking.hosts = hostsMap; } diff --git a/morph/lib/make-storage.nix b/morph/lib/make-storage.nix index af0867c8b8342e31393f19a76a7cbfc4c95f86c9..6619336d758f69a677e9178592357480aed3f0c8 100644 --- a/morph/lib/make-storage.nix +++ b/morph/lib/make-storage.nix @@ -11,8 +11,36 @@ # to avoid breaking some software such as # database servers. You should change this only # after NixOS release notes say you should. +, monitoringvpnKeyDir ? null # The directory that holds the VPN keys. +, monitoringvpnIPv4 ? null # This node's IP in the monitoring VPN. +, monitoringvpnEndpoint ? null # The VPN server and port. , ... -}: rec { +}: let + + enableVpn = monitoringvpnKeyDir != null && + monitoringvpnIPv4 != null && + monitoringvpnEndpoint != null; + + vpnSecrets = if !enableVpn then {} else { + "monitoringvpn-secret-key" = { + source = monitoringvpnKeyDir + "/${monitoringvpnIPv4}.key"; + destination = "/run/keys/monitoringvpn/client.key"; + owner.user = "root"; + owner.group = "root"; + permissions = "0400"; + action = ["sudo" "systemctl" "restart" "wireguard-monitoringvpn.service"]; + }; + "monitoringvpn-preshared-key" = { + source = monitoringvpnKeyDir + "/preshared.key"; + destination = "/run/keys/monitoringvpn/preshared.key"; + owner.user = "root"; + owner.group = "root"; + permissions = "0400"; + action = ["sudo" "systemctl" "restart" "wireguard-monitoringvpn.service"]; + }; + }; + +in rec { deployment = { targetHost = cfg.publicIPv4; @@ -28,7 +56,7 @@ # extract it from the tahoe-lafs nixos module somehow? action = ["sudo" "systemctl" "restart" "tahoe.storage.service"]; }; - }; + } // vpnSecrets; }; # Any extra NixOS modules to load on this server. @@ -40,6 +68,10 @@ # Bring in our module for configuring the Tahoe-LAFS service and other # Private Storage-specific things. ../../nixos/modules/private-storage.nix + # Connect to the monitoringvpn. + ../../nixos/modules/monitoring/vpn/client.nix + # Expose base system metrics over the monitoringvpn. + ../../nixos/modules/monitoring/exporters/node.nix ]; # Pass the configuration specific to this host to the 100TB module to be @@ -67,4 +99,11 @@ }; system.stateVersion = stateVersion; + + services.private-storage.monitoring.vpn.client = if !enableVpn then {} else { + enable = true; + ip = monitoringvpnIPv4; + endpoint = monitoringvpnEndpoint; + endpointPublicKeyFile = monitoringvpnKeyDir + "/server.pub"; + }; } diff --git a/morph/lib/make-testing.nix b/morph/lib/make-testing.nix index f1c1b56fc5444322a8f3a1191fe296fe23528a3e..3f6e767db5ee734a8ca2314b216d4fa602c01907 100644 --- a/morph/lib/make-testing.nix +++ b/morph/lib/make-testing.nix @@ -57,6 +57,7 @@ in rec { hardware ../../nixos/modules/private-storage.nix ../../nixos/modules/monitoring/vpn/client.nix + ../../nixos/modules/monitoring/exporters/node.nix ]; services.private-storage = diff --git a/nixos/modules/monitoring/exporters/node.nix b/nixos/modules/monitoring/exporters/node.nix new file mode 100644 index 0000000000000000000000000000000000000000..62702e82f1e0a6bd9effae871f275c5dd23a37ae --- /dev/null +++ b/nixos/modules/monitoring/exporters/node.nix @@ -0,0 +1,74 @@ +# Prometheus common node exporter config +# +# Scope: Export platform data like CPU, memory, disk space etc. to be +# polled by Prometheus server +# Usage: Import this to every server you want to include in the central +# monitoring system +# See https://nixos.org/manual/nixos/stable/#module-services-prometheus-exporters + +{ config, lib, pkgs, ... }: + +with lib; + +let + mountsFileSystemType = fsType: {} != filterAttrs (n: v: v.fsType == fsType) config.fileSystems; + +in { + config.services.prometheus.exporters.node = { + enable = true; + openFirewall = true; + firewallFilter = "-i monitoringvpn -p tcp -m tcp --dport 9100"; + port = 9100; + # extraFlags = [ "--collector.disable-defaults" ]; # not in nixpkgs 19.09 + # Thanks https://github.com/mayflower/nixexprs/blob/master/modules/monitoring/default.nix + enabledCollectors = [ + "arp" + "bcache" + "conntrack" + "filefd" + "logind" + "netclass" + "netdev" + "netstat" + #"rapl" # not in nixpkgs 19.09 + "sockstat" + #"softnet" # not in nixpkgs 19.09 + "stat" + "systemd" + # "textfile" + # "textfile.directory /run/prometheus-node-exporter" + #"thermal_zone" # not in nixpkgs 19.09 + "time" + #"udp_queues" # not in nixpkgs 19.09 + "uname" + "vmstat" + ] ++ optionals (!config.boot.isContainer) [ + "cpu" + "cpufreq" + "diskstats" + "edac" + "entropy" + "filesystem" + "hwmon" + "interrupts" + "ksmd" + "loadavg" + "meminfo" + "pressure" + "timex" + ] ++ ( + optionals (config.services.nfs.server.enable) [ "nfsd" ] + ) ++ ( + optionals ("" != config.boot.initrd.mdadmConf) [ "mdadm" ] + ) ++ ( + optionals ({} != config.networking.bonds) [ "bonding" ] + ) ++ ( + optionals (mountsFileSystemType "nfs") [ "nfs" ] + ) ++ ( + optionals (mountsFileSystemType "xfs") [ "xfs" ] + ) ++ ( + optionals (mountsFileSystemType "zfs" || elem "zfs" config.boot.supportedFilesystems) [ "zfs" ] + ); + }; +} + diff --git a/nixos/modules/monitoring/server/grafana-config/resources-overview.json b/nixos/modules/monitoring/server/grafana-config/resources-overview.json new file mode 100644 index 0000000000000000000000000000000000000000..cd171d50594d77153f4d905bd91aec12f6bafcb9 --- /dev/null +++ b/nixos/modules/monitoring/server/grafana-config/resources-overview.json @@ -0,0 +1,1286 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "USE: Usage, Saturation and Error rate for our resources", + "editable": true, + "gnetId": null, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "collapsed": false, + "datasource": null, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 22, + "panels": [], + "title": "CPU & Memory", + "type": "row" + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Some of our software runs in a single thread, so this shows max CPU per core (instead of averaged over all cores)", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 0, + "y": 1 + }, + "hiddenSeries": false, + "id": 28, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "1 - (max by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])))", + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{instance}}", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Max CPU % per core per node", + "tooltip": { + "shared": true, + "sort": 2, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "percentunit", + "label": null, + "logBase": 1, + "max": "1", + "min": "0", + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "alert": { + "alertRuleTags": {}, + "conditions": [ + { + "evaluator": { + "params": [ + 1 + ], + "type": "gt" + }, + "operator": { + "type": "and" + }, + "query": { + "params": [ + "A", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + } + ], + "executionErrorState": "alerting", + "for": "5m", + "frequency": "1m", + "handler": 1, + "name": "15 min load average alert", + "noDataState": "no_data", + "notifications": [] + }, + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "fieldConfig": { + "defaults": { + "custom": {}, + "displayName": "${__field.labels.instance}" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "node_load15{instance=\"grafana:9100\", job=\"node-exporters\"}" + }, + "properties": [ + { + "id": "links" + } + ] + } + ] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 8, + "y": 1 + }, + "hiddenSeries": false, + "id": 6, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "node_load15", + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{instance}}", + "refId": "A" + } + ], + "thresholds": [ + { + "colorMode": "critical", + "fill": true, + "line": true, + "op": "gt", + "value": 1, + "yaxis": "left" + } + ], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "15 min load average", + "tooltip": { + "shared": true, + "sort": 2, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": "1", + "min": null, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "alert": { + "alertRuleTags": {}, + "conditions": [ + { + "evaluator": { + "params": [ + 0.8 + ], + "type": "gt" + }, + "operator": { + "type": "and" + }, + "query": { + "params": [ + "A", + "15m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + } + ], + "executionErrorState": "alerting", + "for": "5m", + "frequency": "1m", + "handler": 1, + "name": "RAM filling up", + "noDataState": "no_data", + "notifications": [] + }, + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "How much RAM is in use? Relative to available system memory.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 16, + "y": 1 + }, + "hiddenSeries": false, + "id": 2, + "legend": { + "alignAsTable": false, + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes\r\n", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}}", + "refId": "A" + } + ], + "thresholds": [ + { + "colorMode": "critical", + "fill": true, + "line": true, + "op": "gt", + "value": 0.8, + "yaxis": "left" + } + ], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "RAM used %", + "tooltip": { + "shared": true, + "sort": 2, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "decimals": null, + "format": "percentunit", + "label": null, + "logBase": 1, + "max": "1", + "min": "0", + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "collapsed": false, + "datasource": null, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 8 + }, + "id": 20, + "panels": [], + "title": "Network", + "type": "row" + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Shows most saturated network link for every node. Baseline is the reported NIC link speed - that might not be the actual limit.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 0, + "y": 9 + }, + "hiddenSeries": false, + "id": 12, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "max by (instance) (rate(node_network_transmit_bytes_total{device!~\"lo|monitoringvpn\"}[5m]) / node_network_speed_bytes)", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} out", + "refId": "A" + }, + { + "expr": "- max by (instance) (rate(node_network_receive_bytes_total{device!~\"lo|monitoringvpn\"}[5m]) / node_network_speed_bytes)", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} in", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Throughput %", + "tooltip": { + "shared": false, + "sort": 2, + "value_type": "individual" + }, + "transformations": [], + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "decimals": null, + "format": "percentunit", + "label": null, + "logBase": 1, + "max": "1", + "min": "-1", + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Packet and error count. Positive values mean transmit, negative receive.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 8, + "y": 9 + }, + "hiddenSeries": false, + "id": 26, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null as zero", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "- rate(node_network_receive_packets_total{device!~\"lo|monitoringvpn\"}[5m])", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} {{device}}", + "refId": "A" + }, + { + "expr": "- rate(node_network_receive_errs_total{device!~\"lo|monitoringvpn\"}[5m])", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} {{device}}", + "refId": "B" + }, + { + "expr": "rate(node_network_transmit_packets_total{device!~\"lo|monitoringvpn\"}[5m])", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} {{device}}", + "refId": "C" + }, + { + "expr": "rate(node_network_transmit_errs_total{device!~\"lo|monitoringvpn\"}[5m])", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} {{device}}", + "refId": "D" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Network pkt/s", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "alert": { + "alertRuleTags": {}, + "conditions": [ + { + "evaluator": { + "params": [ + 10 + ], + "type": "gt" + }, + "operator": { + "type": "and" + }, + "query": { + "params": [ + "A", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + }, + { + "evaluator": { + "params": [ + 10 + ], + "type": "gt" + }, + "operator": { + "type": "or" + }, + "query": { + "params": [ + "B", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + }, + { + "evaluator": { + "params": [ + 10 + ], + "type": "gt" + }, + "operator": { + "type": "or" + }, + "query": { + "params": [ + "C", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + }, + { + "evaluator": { + "params": [ + 10 + ], + "type": "gt" + }, + "operator": { + "type": "or" + }, + "query": { + "params": [ + "D", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + } + ], + "executionErrorState": "alerting", + "for": "5m", + "frequency": "1m", + "handler": 1, + "name": "Network errors alert", + "noDataState": "no_data", + "notifications": [] + }, + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Network errors, drops etc. Should all be 0.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 16, + "y": 9 + }, + "hiddenSeries": false, + "id": 10, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "node_network_transmit_errs_total\n", + "interval": "", + "legendFormat": "{{instance}} {{device}}", + "refId": "A" + }, + { + "expr": "node_network_transmit_drop_total", + "interval": "", + "legendFormat": "{{instance}} {{device}}", + "refId": "B" + }, + { + "expr": "- node_network_receive_drop_total", + "interval": "", + "legendFormat": "{{instance}} {{device}}", + "refId": "C" + }, + { + "expr": "- node_network_receive_errs_total", + "interval": "", + "legendFormat": "{{instance}} {{device}}", + "refId": "D" + } + ], + "thresholds": [ + { + "colorMode": "critical", + "fill": true, + "line": true, + "op": "gt", + "value": 10 + } + ], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Network errors", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "collapsed": false, + "datasource": null, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 16 + }, + "id": 18, + "panels": [], + "title": "Storage", + "type": "row" + }, + { + "alert": { + "alertRuleTags": {}, + "conditions": [ + { + "evaluator": { + "params": [ + 0.8 + ], + "type": "gt" + }, + "operator": { + "type": "and" + }, + "query": { + "params": [ + "A", + "5m", + "now" + ] + }, + "reducer": { + "params": [], + "type": "avg" + }, + "type": "query" + } + ], + "executionErrorState": "alerting", + "for": "5m", + "frequency": "1m", + "handler": 1, + "name": "Filesystem usage % alert", + "noDataState": "no_data", + "notifications": [] + }, + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Watch filesystems filling up. Shows only mounts over 10 % of available bytes used.", + "fieldConfig": { + "defaults": { + "custom": {}, + "unit": "percentunit" + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 0, + "y": 17 + }, + "hiddenSeries": false, + "id": 4, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) > 0.1", + "format": "time_series", + "hide": false, + "instant": false, + "interval": "", + "intervalFactor": 2, + "legendFormat": "{{instance}} {{mountpoint}} ", + "refId": "A" + } + ], + "thresholds": [ + { + "colorMode": "critical", + "fill": true, + "line": true, + "op": "gt", + "value": 0.8, + "yaxis": "left" + } + ], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Storage usage %", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "transformations": [], + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "percentunit", + "label": null, + "logBase": 1, + "max": "1", + "min": "0", + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Input Output Operations per second. Positive values mean read, negative write.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 8, + "y": 17 + }, + "hiddenSeries": false, + "id": 14, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null as zero", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "rate(node_disk_reads_completed_total[5m]) > 0", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} R {{device}}", + "refId": "A" + }, + { + "expr": "- (rate(node_disk_writes_completed_total[5m]) > 0)", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} W {{device}}", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "IOPS", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": null, + "description": "Max average storage latency per node. Positive values mean read, negative write.", + "fieldConfig": { + "defaults": { + "custom": {} + }, + "overrides": [] + }, + "fill": 0, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 8, + "x": 16, + "y": 17 + }, + "hiddenSeries": false, + "id": 16, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": false, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null as zero", + "options": { + "alertThreshold": true, + "dataLinks": [] + }, + "percentage": false, + "pluginVersion": "7.3.5", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "max by (instance, device) (rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]))", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} R {{device}}", + "refId": "A" + }, + { + "expr": "- max by (instance, device) (rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]))", + "interval": "", + "intervalFactor": 4, + "legendFormat": "{{instance}} W {{device}}", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Storage latency", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "format": "s", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + } + ], + "refresh": "30s", + "schemaVersion": 20, + "style": "dark", + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "timepicker": {}, + "timezone": "utc", + "title": "Resources overview", + "uid": "ResSatUse", + "version": 1 +} diff --git a/nixos/modules/monitoring/server/grafana.nix b/nixos/modules/monitoring/server/grafana.nix new file mode 100644 index 0000000000000000000000000000000000000000..d5724e7188cab5155d7f1976420185388caf5d64 --- /dev/null +++ b/nixos/modules/monitoring/server/grafana.nix @@ -0,0 +1,79 @@ +# Grafana Server +# +# Scope: Beautiful plots of time series data retrieved from Prometheus +# See https://christine.website/blog/prometheus-grafana-loki-nixos-2020-11-20 + +{ config, lib, ... }: + +let + cfg = config.services.private-storage.monitoring.grafana; + +in { + options.services.private-storage.monitoring.grafana = { + domain = lib.mkOption + { type = lib.types.str; + example = lib.literalExample "grafana.grid.private.storage"; + description = "The FQDN of the Grafana host"; + }; + prometheusUrl = lib.mkOption + { type = lib.types.str; + example = lib.literalExample "http://prometheus:9090/"; + default = "http://prometheus:9090/"; + description = "The URL of the Prometheus host to access"; + }; + lokiUrl = lib.mkOption + { type = lib.types.str; + example = lib.literalExample "http://loki:3100/"; + default = "http://loki:3100/"; + description = "The URL of the Loki host to access"; + }; + }; + + config = { + # networking.firewall.allowedTCPPorts = [ 80 443 ]; + + services.grafana = { + enable = true; + domain = cfg.domain; + port = 2342; + addr = "127.0.0.1"; + + # All three are required to forego the user/pass prompt: + auth.anonymous.enable = true; + auth.anonymous.org_role = "Admin"; + auth.anonymous.org_name = "Main Org."; + }; + + services.grafana.provision = { + enable = true; + # See https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources + datasources = [{ + name = "Prometheus"; + type = "prometheus"; + access = "proxy"; + url = cfg.prometheusUrl; + isDefault = true; + } { + name = "Loki"; + type = "loki"; + access = "proxy"; + url = cfg.lokiUrl; + }]; + # See https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards + dashboards = [{ + name = "provisioned"; + options.path = ./grafana-config; + }]; + }; + + # nginx reverse proxy + services.nginx.enable = true; + services.nginx.virtualHosts.${config.services.grafana.domain} = { + locations."/" = { + proxyPass = "http://127.0.0.1:${toString config.services.grafana.port}"; + proxyWebsockets = true; + }; + }; + }; +} + diff --git a/nixos/modules/monitoring/server/loki.nix b/nixos/modules/monitoring/server/loki.nix new file mode 100644 index 0000000000000000000000000000000000000000..96554523f06d0d86c620db445b2443575a1c3fd3 --- /dev/null +++ b/nixos/modules/monitoring/server/loki.nix @@ -0,0 +1,78 @@ +# Loki Server +# +# Scope: Log aggregator + +{ + config.networking.firewall.allowedTCPPorts = [ 3100 ]; + + config.services.loki = { + enable = true; + + configuration = + { + auth_enabled = false; + + server = { + http_listen_port = 3100; + }; + + ingester = { + lifecycler = { + address = "0.0.0.0"; + ring = { + kvstore = { + store = "inmemory"; + }; + replication_factor = 1; + }; + final_sleep = "0s"; + }; + chunk_idle_period = "1h"; # Any chunk not receiving new logs in this time will be flushed + max_chunk_age = "1h"; # All chunks will be flushed when they hit this age, default is 1h + chunk_target_size = 1048576; # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first + chunk_retain_period = "30s"; # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m) + max_transfer_retries = 0; # Chunk transfers disabled + }; + + schema_config = { + configs = [{ + from = "2020-10-24"; # TODO: Should this be "today"? + store = "boltdb-shipper"; + object_store = "filesystem"; + schema = "v11"; + index = { + prefix = "index_"; + period = "24h"; + }; + }]; + }; + + storage_config = { + boltdb_shipper = { + active_index_directory = "/var/lib/loki/boltdb-shipper-active"; + cache_location = "/var/lib/loki/boltdb-shipper-cache"; + cache_ttl = "24h"; # Can be increased for faster performance over longer query periods, uses more disk space + shared_store = "filesystem"; + }; + filesystem = { + directory = "/var/lib/loki/chunks"; + }; + }; + + limits_config = { + reject_old_samples = true; + reject_old_samples_max_age = "168h"; + }; + + chunk_store_config = { + max_look_back_period = "336h"; + }; + + table_manager = { + retention_deletes_enabled = true; + retention_period = "336h"; + }; + }; + }; +} + diff --git a/nixos/modules/monitoring/server/prometheus.nix b/nixos/modules/monitoring/server/prometheus.nix new file mode 100644 index 0000000000000000000000000000000000000000..36c2ba6402559771dff8771b1369842e21f7ff7f --- /dev/null +++ b/nixos/modules/monitoring/server/prometheus.nix @@ -0,0 +1,56 @@ +# Prometheus server +# +# Scope: Pull data from our cluster machines into TSDB +# See https://christine.website/blog/prometheus-grafana-loki-nixos-2020-11-20 + +{ config, lib, ... }: +let + + exportersCfg = config.services.prometheus.exporters; + cfg = config.services.private-storage.monitoring.prometheus; + dropPortNumber = { + source_labels = [ "__address__" ]; + regex = "^(.*):\\d+$"; + target_label = "instance"; + }; + +in { + options.services.private-storage.monitoring.prometheus = { + nodeExporterTargets = lib.mkOption { + type = with lib.types; listOf str; + example = lib.literalExample "[ node1 node2 ]"; + description = "List of nodes (hostnames or IPs) to scrape."; + }; + nginxExporterTargets = lib.mkOption { + type = with lib.types; listOf str; + example = lib.literalExample "[ node1 node2 ]"; + description = "List of nodes (hostnames or IPs) to scrape."; + }; + }; + + config = rec { + # networking.firewall.allowedTCPPorts = [ services.prometheus.port ]; + + services.prometheus = { + enable = true; + # port = 9090; # Option only in recent (20.09?) nixpkgs, 9090 default + scrapeConfigs = [ + { + job_name = "node-exporters"; + static_configs = [{ + targets = map (x: x + ":" + (toString exportersCfg.node.port)) cfg.nodeExporterTargets; + }]; + relabel_configs = [ dropPortNumber ]; + } + { + job_name = "nginx-exporters"; + static_configs = [{ + targets = map (x: x + ":" + (toString exportersCfg.nginx.port)) cfg.nginxExporterTargets; + }]; + relabel_configs = [ dropPortNumber ]; + } + ]; + }; + }; +} + diff --git a/shell.nix b/shell.nix index 6e46c9ca0feaa3ab6fbd22c1228ec786a49e79b6..2c1c5123da656d34fafe0883b50ef49c578c6c8b 100644 --- a/shell.nix +++ b/shell.nix @@ -8,5 +8,6 @@ pkgs.mkShell { buildInputs = [ pkgs.morph stable2105.vagrant + pkgs.jp ]; } diff --git a/tools/create-vpn-keys.sh b/tools/create-vpn-keys.sh new file mode 100755 index 0000000000000000000000000000000000000000..e092a8ced698bd3a3bb2d4acc3ca07a3a8e6032d --- /dev/null +++ b/tools/create-vpn-keys.sh @@ -0,0 +1,40 @@ +#!/usr/bin/env bash + +# Scope: Create wireguard keys for all monitoringVPN hosts +# Parameters: +# file: path to grid.nix of morph deployment +# +# Output: Key files for all monitoring VPN hosts in secrets/monitoringvpn +# relative to the grid.nix +# +# The server key will also be symlinked to server.{key,pub}. + +set -euxo pipefail + +umask 077 + +if [[ $# -ne 1 ]]; then + echo "Illegal number of parameters. Expected: file (path of grid.nix)" + exit 2 +fi + +SRC=$(dirname $0) +VPN_SECRETS=$(dirname $1)/secrets/monitoringvpn + +CONFIG=$(nix-instantiate --strict --json --eval "${SRC}"/get-vpn-config.nix --arg pathToGrid "${1}") + +MONITORING_IPS=$(echo $CONFIG | jp --unquoted "join(' ', clientIPs)") +VPNSERVER_IP=$(echo $CONFIG | jp --unquoted "serverIP") + +mkdir -p "${VPN_SECRETS}" + +for i in $MONITORING_IPS $VPNSERVER_IP; do + wg genkey | tee "${VPN_SECRETS}"/${i}.key | wg pubkey > "${VPN_SECRETS}"/${i}.pub +done + +wg genpsk > "${VPN_SECRETS}"/preshared.key + +ln -fs $VPNSERVER_IP.key "${VPN_SECRETS}"/server.key +ln -fs $VPNSERVER_IP.pub "${VPN_SECRETS}"/server.pub + +# EOF diff --git a/tools/get-vpn-config.nix b/tools/get-vpn-config.nix new file mode 100644 index 0000000000000000000000000000000000000000..7753292aa83c4b63be7457228de0cd84e6eeefa2 --- /dev/null +++ b/tools/get-vpn-config.nix @@ -0,0 +1,19 @@ +# A function that accepts a path to a grid.nix-style file and returns a set +# with two attributes: +# +# * serverIP - a string giving the VPN IP address of the grid's VPN server. +# +# * clientIPs - a list of strings giving the VPN IP addresses of all of the +# grid's VPN clients. +# +{ pathToGrid }: +let + grid = import pathToGrid; + vpnConfig = node: node.services.private-storage.monitoring.vpn or null; + vpnClientIP = node: (vpnConfig node).client.ip or null; + vpnServerIP = node: (vpnConfig node).server.ip or null; +in +{ + "serverIP" = vpnServerIP grid.monitoring; + "clientIPs" = builtins.filter (x: x != null) (map vpnClientIP (builtins.attrValues grid)); +}