openstack_rhosp16.2_nvidia_.../4) Ceph Cluster Setup.md

> RHOSP tripleo can also deploy Ceph
> To separate the storage deployment from the Openstack deployment to simplify any DR/Recovery/Redeployment we will create a stand-alone Ceph cluster and integrate with Openstack overcloud
> Opensource Ceph can be installed for further cost saving
> [https://access.redhat.com/documentation/en-us/red\_hat\_openstack\_platform/16.1/html-single/integrating\_an\_overcloud\_with\_an\_existing\_red\_hat\_ceph\_cluster/index](https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index)

# Ceph pacific setup

## Get access to ceph nodes

- Rocky linux
- 3 Physical Nodes
- 1G Ethernet for Openstack control plane management network
- 2 x 25G Ethernet LACP bond for all service networks
- 4 disk per node, 2 OS RAID1 in BIOS, 2 Ceph 960GB
- Setup OS disk as LVM boot 1GB, root 240GB, swap 4GB
- Credentials - root:Password0

# Ceph architecture

Ceph services:
| \-\-\- |
| 3 monitors |
| 3 managers |
| 6 osd |
| 3 mds (2 standby) - not being used |
| 3 rgw (2 standby - fronted by LBL) - not being used |

Networks:

- 'Ceph public network' (Ceph services) VLAN13, this is the same network as the 'Openstack storage network'.
- 'Ceph cluster network' (OSD replication+services) VLAN15.
- 'Openstack storage management network' VLAN14, this network is a prerequisite of the Openstack Tripleo installer, it may not be used in with an External Ceph installation, it is added to cover all bases.
- 'Openstack control plane network' VLAN1(native), this network will serve as the main ingress to the Ceph cluster nodes.
- 'Openstack external network' VLAN4, this network has an externally routable gateway.

| Network | VLAN | Interface | IP Range | Gateway | DNS |
| --- | --- | --- | --- | --- | --- |
| Ceph public<br>(Openstack storage) | 13  | bond0 | 10.122.10.0/24 | NA  | NA  |
| Ceph cluster | 15  | bond0 | 10.122.14.0/24 | NA  | NA  |
| Openstack storage management | 14  | bond0 | 10.122.12.0/24 | NA  | NA  |
| Openstack control plane | 1(native) | ens4f0 | 10.122.0.0/24 | NA  | NA  |
| Openstack external | 1214 | bond0 | 10.121.4.0/24 | 10.121.4.1 | 144.173.6.71<br>1.1.1.1 |

IP allocation:

> For all ranges, addresses 7-13 in the last octet are reserved for Ceph, There are 3 spare IPs either for additional nodes or RGW/LoadBalancer services.

| Node | ceph1 | ceph2 | ceph3 |
| --- | --- | --- | --- |
| Ceph public<br>(Openstack storage) | 10.122.10.7 | 10.122.10.8 | 10.122.10.9 |
| Ceph cluster | 10.122.14.7 | 10.122.14.8 | 10.122.14.9 |
| Openstack storage management | 10.122.12.7 | 10.122.12.8 | 10.122.12.9 |
| Openstack control plane | 10.122.0.7 | 10.122.0.8 | 10.122.0.9 |
| Openstack external | 10.122.4.7 | 10.122.4.8 | 10.122.4.9 |

# Configure OS

> Perform all actions on all nodes unless specified.
> Substitute IPs and hostnames appropriatley.

## Configure networking

Configure networking with the nmcli method. Connect to the console of the out of band interface and configure the management interface.

```
# likely have NetworkManager enabled on RHEL8 based OS
systemctl list-unit-files --state=enabled | grep -i NetworkManager

# create management interface
# nmcli con add type ethernet ifname ens4f0 con-name openstack-ctlplane connection.autoconnect yes ip4 10.122.0.7/24
nmcli con add type ethernet ifname ens9f0 con-name openstack-ctlplane connection.autoconnect yes ip4 10.122.0.7/24
```

Connect via SSH to configure the bond and VLANS.

```
# create bond interface and add slave interfaces
nmcli con add type bond ifname bond0 con-name bond0 bond.options "mode=802.3ad, miimon=100, downdelay=0, updelay=0" connection.autoconnect yes ipv4.method disabled ipv6.method ignore
# nmcli con add type ethernet ifname ens2f0 master bond0
# nmcli con add type ethernet ifname ens2f1 master bond0
nmcli con add type ethernet ifname ens3f0 master bond0
nmcli con add type ethernet ifname ens3f1 master bond0
nmcli device status

# create vlan interfaces
nmcli con add type vlan ifname bond0.13 con-name ceph-public id 13 dev bond0 connection.autoconnect yes ip4 10.122.10.7/24
nmcli con add type vlan ifname bond0.15 con-name ceph-cluster id 15 dev bond0 connection.autoconnect yes ip4 10.122.14.7/24
nmcli con add type vlan ifname bond0.14 con-name openstack-storage_mgmt id 14 dev bond0 connection.autoconnect yes ip4 10.122.12.7/24
nmcli con add type vlan ifname bond0.1214 con-name openstack-external id 1214 dev bond0 connection.autoconnect yes ip4 10.121.4.7/24 gw4 10.121.4.1 ipv4.dns 144.173.6.71,1.1.1.1 ipv4.dns-search local

# check all devices are up
nmcli device status
nmcli con show
nmcli con show bond0

# check LACP settings
cat /proc/net/bonding/bond0

# remove connection profiles
nmcli con show
nmcli con del openstack-ctlplane
nmcli con del ceph-public
nmcli con del ceph-cluster
nmcli con del openstack-storage_mgmt
nmcli con del openstack-external
nmcli con del bond-slave-ens2f0
nmcli con del bond-slave-ens2f1
nmcli con del bond0
nmcli con show
nmcli device status
```

## Install useful tools and enable Podman

```sh
dnf update -y ;\
dnf install nano lvm2 chrony telnet traceroute wget tar nmap tmux bind-utils net-tools podman python3 mlocate ipmitool tmux wget yum-utils -y ;\
systemctl enable podman ;\
systemctl start podman
```

## Setup hostnames

Cephadm install tool specific setup, Ceph prefers to talk to its peers using IP (FQDN requires more setup and is not recommended in the documentation).

```sh
echo "10.122.10.7 ceph1" | tee -a /etc/hosts ;\
echo "10.122.10.8 ceph2" | tee -a /etc/hosts ;\
echo "10.122.10.9 ceph3" | tee -a /etc/hosts

hostnamectl set-hostname ceph1 # this should not be an FQDN such as ceph1.local (as recommended in ceph documentation)
hostnamectl set-hostname --transient ceph1
```

## Setup NTP

```
dnf install chrony -y
timedatectl set-timezone Europe/London
nano -cw /etc/chrony.conf

server ntp.university.ac.uk iburst
pool 2.cloudlinux.pool.ntp.org iburst

systemctl enable chronyd
systemctl start chronyd
```

## Disable annoyances

```
systemctl disable firewalld
systemctl stop firewalld

# DO NOT DISABLE SELINUX - now a requirement of Ceph, containers will not start without SELINUX enforcing
#sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
#getenforce
#setenforce 0
#getenforce
```

## Reboot

```sh
reboot
```

# Ceph install

## Download cephadm deployment tool

```
#curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
curl --silent --remote-name --location https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
chmod +x cephadm
```

## Add the Ceph yum repo and install the cephadm tool to the system, then remove the installer.

```
# this may not be required with the pacific version of cephadm
# add rockylinux / almalinux to the accepted distributions in the installer
nano -cw cephadm

class YumDnf(Packager):
    DISTRO_NAMES = {
        'rocky' : ('centos', 'el'),
        'almalinux': ('centos', 'el'),
        'centos': ('centos', 'el'),
        'rhel': ('centos', 'el'),
        'scientific': ('centos', 'el'),
        'fedora': ('fedora', 'fc'),
    }

./cephadm add-repo --release pacific
./cephadm install
which cephadm
rm ./cephadm
```

## Bootstrap the first mon node

> This action should be performed ONLY on ceph1.

- Bootstrap the mon daemon on this node, using the mon network interface (referred to as the public network in ceph documentation).
- Bootstrap will pull the correct docker image and setup the host config files and systemd scripts (to start daemon containers).
- The /etc/ceph/ceph.conf config is populated with a unique cluster fsid ID and mon0 host connection profile.

```
mkdir -p /etc/ceph
cephadm bootstrap --mon-ip 10.122.10.7 --skip-mon-network --cluster-network 10.122.14.0/24

# copy the output of the command to file

Ceph Dashboard is now available at:

             URL: https://ceph1:8443/
            User: admin
        Password: Password0

Enabling client.admin keyring and conf on hosts with "admin" label
Enabling autotune for osd_memory_target
You can access the Ceph CLI as following in case of multi-cluster or non-default config:

        sudo /usr/sbin/cephadm shell --fsid 5b99e574-4577-11ed-b70e-e43d1a63e590 -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring

Or, if you are only running a single cluster on this host:

        sudo /usr/sbin/cephadm shell

cat /etc/ceph/ceph.conf

# minimal ceph.conf for 5b99e574-4577-11ed-b70e-e43d1a63e590
[global]
        fsid = 5b99e574-4577-11ed-b70e-e43d1a63e590
        mon_host = [v2:10.122.10.7:3300/0,v1:10.122.10.7:6789/0]
```

## Install the ceph cli on the first mon node

> This action should be performed on ceph1.

The cli can also be used via a container shell without installation, the cephadm installation method configures the cli tool to target the container daemons.

```
cephadm install ceph-common
ceph -v

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)

# ceph status
ceph -s

  cluster:
    id:     5b99e574-4577-11ed-b70e-e43d1a63e590
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum ceph1 (age 2m)
    mgr: ceph1.virprg(active, since 46s)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:
```

## Push ceph ssh pub key to other ceph nodes

> This action should be performed on ceph1.

```
ceph cephadm get-pub-key > ~/ceph.pub
for i in {2..3};do ssh-copy-id -f -i ~/ceph.pub root@ceph$i;done
```

Test connectivity of the ceph key.

```
ceph config-key get mgr/cephadm/ssh_identity_key > ~/ceph.pvt
chmod 0600 ~/ceph.pvt
ssh -i ceph.pvt root@ceph2
ssh -i ceph.pvt root@ceph3
```

## Add more mon nodes

> This action should be performed on ceph1.
> `_admin` label populates the /etc/ceph config files to allow cli usage on each host.

```
ceph orch host add ceph2 10.122.10.8 --labels _admin
ceph orch host add ceph3 10.122.10.9 --labels _admin
```

## Install the ceph cli on the remaining nodes

```
ssh -i ceph.pvt root@ceph2
cephadm install ceph-common
ceph -s
exit

ssh -i ceph.pvt root@ceph3
cephadm install ceph-common
ceph -s
exit
```

## Set the operating networks, the cluster network and public network are in the same network.

> This action should be performed on ceph1.

```
ceph config set global public_network 10.122.10.0/24
ceph config set global cluster_network 10.122.14.0/24
ceph config dump
```

## Add all labels to the node

> This action should be performed on ceph1.

These are arbitrary label values to assist with service placement, however there are special labels with functionality such as '_admin'.

> https://docs.ceph.com/en/latest/cephadm/host-management/

```
ceph orch host label add ceph1 mon ;\
ceph orch host label add ceph1 osd ;\
ceph orch host label add ceph1 mgr ;\
ceph orch host label add ceph1 mds ;\
ceph orch host label add ceph1 rgw ;\
ceph orch host label add ceph2 mon ;\
ceph orch host label add ceph2 osd ;\
ceph orch host label add ceph2 mgr ;\
ceph orch host label add ceph2 mds ;\
ceph orch host label add ceph2 rgw ;\
ceph orch host label add ceph3 mon ;\
ceph orch host label add ceph3 osd ;\
ceph orch host label add ceph3 mgr ;\
ceph orch host label add ceph3 mds ;\
ceph orch host label add ceph3 rgw ;\
ceph orch host ls

HOST   ADDR         LABELS                      STATUS
ceph1  10.122.10.7  _admin mon osd mgr mds rgw
ceph2  10.122.10.8  _admin mon osd mgr mds rgw
ceph3  10.122.10.9  _admin mon osd mgr mds rgw
3 hosts in cluster
```

## Deploy core daemons to hosts

> This action should be performed on ceph1.
> More daemons will be applied as they are added.
> https://docs.ceph.com/en/latest/cephadm/services/#orchestrator-cli-placement-spec

```
#ceph orch apply mon --placement="label:mon" --dry-run
ceph orch apply mon --placement="label:mon"
ceph orch apply mgr --placement="label:mgr"
ceph orch ls # keep checking until all services are up, should be <1 minute

NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager   ?:9093,9094      1/1  25s ago    36m  count:1
crash                           3/3  111s ago   36m  *
grafana        ?:3000           1/1  25s ago    36m  count:1
mgr                             3/3  111s ago   43s  label:mgr
mon                             3/3  111s ago   50s  label:mon
node-exporter  ?:9100           3/3  111s ago   36m  *
prometheus     ?:9095           1/1  25s ago    36m  count:1
```

## Setup the mgr dashboard to listen on a specific IP (the only range in this case)

> This action should be performed on ceph1.
> https://docs.ceph.com/en/latest/mgr/dashboard/

When adding multiple dashboards only one node will be the active dashboard and the others will be in standby status, should you connect to another hosts @https:8443 you will be redirected to the active dashboard node.

```sh
# dashboard is not being run on the public_network, instead on the routable network, we also put the Openstack dashboard here
ceph config set mgr mgr/dashboard/ceph1/server_addr 10.121.4.7 ;\
ceph config set mgr mgr/dashboard/ceph2/server_addr 10.121.4.8 ;\
ceph config set mgr mgr/dashboard/ceph3/server_addr 10.121.4.9

# stop/start ceph
systemctl stop ceph.target;sleep 5;systemctl start ceph.target

# check service endpoints, likely the mgr service is running on ceph1 with ceph2/3 acting as standby
ceph mgr services

{
    "dashboard": "https://10.122.10.7:8443/",
    "prometheus": "http://10.122.10.7:9283/"
}

# the dashboard seems to listen on any interface
ss -taln | grep 8443

LISTEN 0      5                  *:8443            *:*

# config confims dashboard listening address
ceph config dump | grep "mgr/dashboard/ceph1/server_addr"

  mgr         advanced  mgr/dashboard/ceph1/server_addr        10.121.4.7
```

Reset dashboard admin user password.

```
ceph dashboard ac-user-show
["admin"]

echo 'Password0' > password.txt
ceph dashboard ac-user-set-password admin -i password.txt
rm -f password.txt
```

Netstat shows graphana is also listening on ceph1.

> https://ceph1:8443/ Dashboard
> https://ceph1:3000/ Graphana
> http://ceph1:9283/ Prometheus

## Ceph OSD

#### Add OSD

> drive-groups method is a new way to specify which disk is to be made an OSD, (types - data, db, wal), you can select disks by cluster node, by path, by serial number, by model or by size - this is useful for large estates and very fast.
> https://docs.ceph.com/en/latest/cephadm/services/osd/#drivegroups
> https://docs.ceph.com/en/pacific/rados/configuration/bluestore-config-ref/
> https://docs.ceph.com/en/octopus/cephadm/drivegroups

These instructions are fairly new but will work with OSDs nested on LVM volumes and full disks, as will probably be the standard in future.

- Perform any disk prep if required
- Enter container shell.
- Seed keyring with OSD credential.
- Prepare OSD (import into mon map with keys etc).
- Signal to the host to create OSD daemon containers.

For the Production cluster build each node will a create logical volume on each of the 8 spinning disks, the SSD disk will be carved into 8 logical volumes with each volume acting as the wal/db device for a spinning disk.

Create the logical volumes on each node:

```
# find OSD disks
lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    0 223.5G  0 disk
├─sda1        8:1    0   600M  0 part /boot/efi
├─sda2        8:2    0     1G  0 part /boot
└─sda3        8:3    0 221.9G  0 part
  ├─rl-root 253:0    0 217.9G  0 lvm  /
  └─rl-swap 253:1    0     4G  0 lvm  [SWAP]
sdb           8:16   0   1.5T  0 disk
sdc           8:32   0  12.8T  0 disk
sdd           8:48   0  12.8T  0 disk
sde           8:64   0  12.8T  0 disk
sdf           8:80   0  12.8T  0 disk
sdg           8:96   0  12.8T  0 disk
sdh           8:112  0  12.8T  0 disk
sdi           8:128  0  12.8T  0 disk
sdj           8:144  0  12.8T  0 disk

# create volume groups on each disk
vgcreate ceph-block-0 /dev/sdc ;\
vgcreate ceph-block-1 /dev/sdd ;\
vgcreate ceph-block-2 /dev/sde ;\
vgcreate ceph-block-3 /dev/sdf ;\
vgcreate ceph-block-4 /dev/sdg ;\
vgcreate ceph-block-5 /dev/sdh ;\
vgcreate ceph-block-6 /dev/sdi ;\
vgcreate ceph-block-7 /dev/sdj

# create logical volumes on each volume group
lvcreate -l 100%FREE -n block-0 ceph-block-0 ;\
lvcreate -l 100%FREE -n block-1 ceph-block-1 ;\
lvcreate -l 100%FREE -n block-2 ceph-block-2 ;\
lvcreate -l 100%FREE -n block-3 ceph-block-3 ;\
lvcreate -l 100%FREE -n block-4 ceph-block-4 ;\
lvcreate -l 100%FREE -n block-5 ceph-block-5 ;\
lvcreate -l 100%FREE -n block-6 ceph-block-6 ;\
lvcreate -l 100%FREE -n block-7 ceph-block-7

# create volume groups on the SSD disk
vgcreate ceph-db-0 /dev/sdb

# divide the SSD disk into 8 logical volumes to provide a DB device
lvcreate -L 180GB -n db-0 ceph-db-0 ;\
lvcreate -L 180GB -n db-1 ceph-db-0 ;\
lvcreate -L 180GB -n db-2 ceph-db-0 ;\
lvcreate -L 180GB -n db-3 ceph-db-0 ;\
lvcreate -L 180GB -n db-4 ceph-db-0 ;\
lvcreate -L 180GB -n db-5 ceph-db-0 ;\
lvcreate -L 180GB -n db-6 ceph-db-0 ;\
lvcreate -L 180GB -n db-7 ceph-db-0
```

Write the OSD service spec file and apply, this should only be run on a single _admin node, Ceph1.

```
# enter into a container with the toolchain and keys
cephadm shell -m /var/lib/ceph

# pull credentials from the database to a file for the ceph-volume tool
ceph auth get-or-create client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring

# If there is an issue ingesting an disk to an OSD, all partition structures can be destroyed with the following command
#ceph-volume lvm zap /dev/sdb
#sgdisk --zap-all /dev/sdb
# there are a few methods to rescan disk and have for kernel address, often a reboot is the quickest way to get OSDs recognised after schedule for ingestion
# exit
# reboot

# Example methods of provisioning disk as OSD via the cli, **use the service spec yaml method**

## for LVM
#ceph-volume lvm prepare --data /dev/almalinux/osd0 --no-systemd
#ceph cephadm osd activate ceph1 # magic command that creates the systemd unit file(s) on the host to bring up an OSD daemon container
#ceph-volume lvm list

## for whole disk, manual method, this is probably a legacy method but is reliable
#ceph orch daemon add osd ceph1:/dev/sda
#ceph orch daemon add osd ceph1:/dev/sdb

# **Prefered method of provision using service specification**

## service spec method
## for whole disk or LVM, new drive-groups method with a single configuration and one-shot command
# only needs to be performed on one node, ceph1
# you can perform this on the native operating system, this will help put the osd_spec.yml file in source control
# for LVM partitions on whole disk in University this was done in the cephadm shell (cephadm shell -m /var/lib/ceph) as theis is where the ceph orch command seemd to work

# for use of any kind of discovery based auto selection of the disk you can query a disk to get traits, this should work on whole disk and LVMs alike
# ceph-volume inventory /dev/ceph-block-0/block-0
#
# ====== Device report /dev/ceph-db-0/db-0 ======
#
#      path                      /dev/ceph-db-0/db-0
#      lsm data                  {}
#      available                 False
#      rejected reasons          Device type is not acceptable. It should be raw device or partition
#      device id
#     --- Logical Volume ---
#      name                      db-0
#      comment                   not used by ceph

# create the service spec file, this will include multiple yaml documents delimited by ---,
vi osd_spec.yml

---
service_type: osd
service_id: block-0
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-0/block-0
  db_devices:
    paths:
    - /dev/ceph-db-0/db-0
---
service_type: osd
service_id: block-1
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-1/block-1
  db_devices:
    paths:
    - /dev/ceph-db-0/db-1
---
service_type: osd
service_id: block-2
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-2/block-2
  db_devices:
    paths:
    - /dev/ceph-db-0/db-2
---
service_type: osd
service_id: block-3
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-3/block-3
  db_devices:
    paths:
    - /dev/ceph-db-0/db-3
---
service_type: osd
service_id: block-4
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-4/block-4
  db_devices:
    paths:
    - /dev/ceph-db-0/db-4
---
service_type: osd
service_id: block-5
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-5/block-5
  db_devices:
    paths:
    - /dev/ceph-db-0/db-5
---
service_type: osd
service_id: block-6
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-6/block-6
  db_devices:
    paths:
    - /dev/ceph-db-0/db-6
---
service_type: osd
service_id: block-7
placement:
  hosts:
    - ceph1
    - ceph2
    - ceph3
spec:
  data_devices:
    paths:
    - /dev/ceph-block-7/block-7
  db_devices:
    paths:
    - /dev/ceph-db-0/db-7

ceph orch apply -i osd_spec.yml # creates the systemd unit file(s) on the host to bring up OSD daemon containers (1 container per OSD)

# exit the container

# wait whilst OSDs are created, you will see a container per OSD
podman ps -a
ceph status

  cluster:
    id:     5b99e574-4577-11ed-b70e-e43d1a63e590
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 75m)
    mgr: ceph1.fgnquq(active, since 75m), standbys: ceph2.whhrir, ceph3.mxipmg
    osd: 24 osds: 24 up (since 2m), 24 in (since 3m)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   4.2 TiB used, 306 TiB / 310 TiB avail
    pgs:     1 active+clean

# check OSD tree
ceph osd df tree

ID  CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
-1         309.82068         -  310 TiB  4.2 TiB   19 MiB   0 B  348 MiB  306 TiB  1.36  1.00    -          root default
-3         103.27356         -  103 TiB  1.4 TiB  6.3 MiB   0 B  116 MiB  102 TiB  1.36  1.00    -              host ceph1
 0    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.0
 4    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.4
 8    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.8
11    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.11
12    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.12
16    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.16
18    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.18
23    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    1      up          osd.23
-5         103.27356         -  103 TiB  1.4 TiB  6.3 MiB   0 B  116 MiB  102 TiB  1.36  1.00    -              host ceph2
 1    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    1      up          osd.1
 3    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.3
 6    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.6
 9    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.9
14    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.14
15    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.15
19    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.19
22    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.22
-7         103.27356         -  103 TiB  1.4 TiB  6.3 MiB   0 B  116 MiB  102 TiB  1.36  1.00    -              host ceph3
 2    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.2
 5    hdd   12.90919   1.00000   13 TiB  180 GiB  804 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.5
 7    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.7
10    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   15 MiB   13 TiB  1.36  1.00    0      up          osd.10
13    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.13
17    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.17
20    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    1      up          osd.20
21    hdd   12.90919   1.00000   13 TiB  180 GiB  808 KiB   0 B   14 MiB   13 TiB  1.36  1.00    0      up          osd.21
                         TOTAL  310 TiB  4.2 TiB   19 MiB   0 B  348 MiB  306 TiB  1.36
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
```

Deleting OSDs, at least one OSD should be left for metrics/config pools to function, removing all OSDs will tank an install and is only useful to remove a ceph cluster, usually you would rebuild fresh.

```
# remove all OSDs, this is only useful if you intend to destroy the ceph cluster - DANGEROUS
# doesnt really work when all OSDs are removed as key operating pools are destroyed not just degraded

#!/bin/bash
for i in {0..12}
do
  ceph osd out osd.$i
  ceph osd down osd.$i
  ceph osd rm osd.$i
  ceph osd crush rm osd.$i
  ceph auth del osd.$i
  ceph osd destroy $i --yes-i-really-mean-it
  ceph orch daemon rm osd.$i --force
  ceph osd df tree
done
ceph osd crush rm ceph1
ceph osd crush rm ceph2
ceph osd crush rm ceph3
```

## Enable autotune memory usage on OSD nodes

> This action should be performed on ceph1.

```
ceph config set osd osd_memory_target_autotune true
ceph config get osd osd_memory_target_autotune
```

## Enable placement group autoscaling for any pool subsequently added

> This action should be performed on ceph1.

```
ceph config set global osd_pool_default_pg_autoscale_mode on
ceph osd pool autoscale-status
```

# Erasure coding

## Understanding EC

The ruleset for EC is not so clear especially for small clusters, the following explanation/rules should be followed for a small 3 node Ceph cluster. In fact you have only one available scheme in reality, K=2, M=1.

- K the number of chunks origional data is divided into
- M the extra codes (basically parity) used with the data
- N the number of chunks created for each piece of data K+M
- Crush failure domain - can be OSD, RACK, HOST (and a few more if listed in crushmap such as PDU, DATACENTRE), basically this dictates the dispersal of the M data (i would guess K data also, to allow for larger schemes RACK).
- Failure domains - OSD seems to be only for testing?, HOST seems to be the most typical use case, RACK seems very sensible but requires many nodes.
- **What you wont find documented clearly is that there need to be at least as many hosts as K+M when using the HOST scheme for resilliency.**
- A 3 node cluster can only support K=2,M=1.
- In a RACK failure domain, say there are 4 racks (with an equal amount of nodes and OSDs most likely), you will have K=3,M=1 allowing for 1 total rack failure.
- EC origionally supported RGW object storage only, RBD pools are now supported (using ec_overwrites) but the pool metadata must still reside on a replicated pool, Openstack has an undocumented setting to use the metadata/data pools.

3 Nodes configuration OSD vs HOST, illustrate the failure domain scheme differences:

- Using a K=2,M=1 and OSD failure domain could mean host1 gets K=1,M=1 and host2 gets K=1. If host1 goes down you wont be able to recreate the data.
- Using K=2,M=1 and HOST failure domain would mean host1 gets K=1, host2 gets K=1, host3 gets M=1 - each node gets a K or M, data and parity is dispursed equally and allows for 1 full node failure.

Ceph supports many different K,M schemes, this doesnt mean they work or offer the protection you want, in some cases the pool creation will stall where the scheme is inadvisable.
It is recommended that you never use more than 80% of the capacity of the storage, above 80% will have performance penalties as data is shuffled about, 100% will set the cluster in read only mode and probably damage in flight data as in any filesystem.

Redhat state that where there is K=4,M=2 you may use K=8,M=4 for greater resillency, they do not state that 12 nodes would be realistically required for this in HOST failure domain.
K=4,M=2 on a 12 node cluster in HOST failure domain mode would work just fine, it would use less CPU/RAM when writing the data chunks to disk, a client may get less read performance on a busy cluster as it would only pull from 50% of the cluster nodes.
Where K+M is an odd number and nodes is an even number (visa versa), data would not be equally distributed across the cluster, with large data files such as VM images the disparity maybe noticable even after automatic re-balancing.
In usual replication lets say there are 3 nodes and 2way replication has been set in the crush map, large files maybe written to two nodes filling them to capacity, whilst considerable free space will be shown available that is effectively unusable, re-balancing will not help.

Redhat supports the following schemes with the jerasure EC plugin (this is the default algorithum):

- K=8,M=3 (minimum 11 nodes using HOST failure domain)
- K=8,M=4 (minimum 12 nodes using HOST failure domain)
- K=4,M=2 (minimum 6 nodes using HOST failure domain)

## EC usable space

### Example 1

For illustration each node has 4 disks (OSDs) of 12TB thus 48TB raw disk, take the following example:

- minimum 3 nodes K=2,M=1 - 144TB raw disk - (12 OSD * (2 K / ( 2 K + 1 M)) * 12TB OSD Size * 0.8 (80% capacity) ) - 76TB usable disk VS 3way replication ((144TB / 3) * 0.8) 38.4TB
- minimum 4 nodes K=3,M=1 - 192TB raw disk - (16 OSD * (3 K / (3 K + 1 M)) * 12TB OSD Size * 0.8) - 115TB usable disk VS 3way replication ((192TB / 3) * 0.8) 51.2TB
- minimum 12 nodes K=9,M=3 - 576TB raw disk - (48 OSD * (9 K / (9 K + 3 M)) * 12TB OSD Size * 0.8) - 345TB usable disk VS 3way replication ((576TB / 3) * 0.8) 153.6

### University Openstack

3 nodes, 8 disks per node (excluding SSD db/wal), 14TB disks thus 336TB raw disk.
All possible storage schemes only allow for 1 failed HOST.

- In a 3 way replication we have 336/3 = 112 * 0.8 = 89.6TB usable space
- In a 2 way replication (more prone to bitrot) we have 336/2 = 168 * 0.8 = 134.4TB usable space
- In a EC scheme of K=2,M=1 we have 24 * (2 / (2+1)) * 14 * 0.8 = 179TB usable space

# Openstack RBD storage

> CephFS/RGW are not being used on this cluster, it is purely to be used for VM image storage.
> For further Openstack CephFS/RGW integration see the OCF LAB notes, these are a much more involved Openstack deployment.

- For RHOSP 16 the controller role must contain all of the ceph services for use with an Openstack provisioned or externally provisioned ceph cluster.
- The Roles created for the University deployment already contain the Ceph services.

## Undercloud Ceph packages

Ensure that your undercloud has the right version of `ceph-ansible` before any deployment.

Get Ceph packages.

> https://access.redhat.com/solutions/2045583

- Redhat Ceph 4.1 = Nautilus release
- Redhat Ceph 5.1 = Pacific release

```sh
sudo subscription-manager repos | grep -i ceph

# Nautilus
sudo subscription-manager repos --enable=rhceph-4-tools-for-rhel-8-x86_64-rpms

# Pacific (if you are using external Ceph from the opensource repos you will likely be using this)
#sudo dnf remove -y ceph-ansible
#sudo subscription-manager repos --disable=rhceph-4-tools-for-rhel-8-x86_64-rpms
sudo subscription-manager repos --enable=rhceph-5-tools-for-rhel-8-x86_64-rpms

# install
sudo dnf info ceph-ansible
sudo dnf install -y ceph-ansible
```

## Create Openstack pools - University uses EC pools, skip to the next section

Listed are the recomended PG allocation using redhat defaults, this isnt very tuned and assumed 100PGs per OSD on a 3 node cluster with 9 disks, opensource Ceph is now up to 250PGs per OSD.

As PG autoscaling is enabled, and as this release is later than Nautilus we can avoid specifying PGs, meaning each pool will be initially allocated 32PGs and scale from there.

RBD pools.

```sh
# Storage for OpenStack Block Storage (cinder)
#ceph osd pool create volumes 2048
ceph osd pool create volumes

# Storage for OpenStack Image Storage (glance)
#ceph osd pool create images 128
ceph osd pool create images

# Storage for instances
#ceph osd pool create vms 256
ceph osd pool create vms

# Storage for OpenStack Block Storage Backup (cinder-backup)
#ceph osd pool create backups 512
ceph osd pool create backups

# Storage for OpenStack Telemetry Metrics (gnocchi)
#ceph osd pool create metrics 128
ceph osd pool create metrics

# Check pools
ceph osd pool ls

device_health_metrics
volumes
images
vms
backups
metrics
```

## Create Erasure Coded Openstack pools

1.  create EC profile (https://docs.ceph.com/en/latest/rados/operations/erasure-code/)
2.  create metadata pools with normal 3 way replication (default replication rule in the crushmap)
3.  create EC pools K=2,M=1,failure domain HOST

metadata pool - replicated pool
data pool - EC pool (with ec_overwrites)

| Metadata Pool | Data Pool | Usage |
| --- | --- | --- |
| volumes | volumes_data | Storage for OpenStack Block Storage (cinder) |
| images | images_data | Storage for OpenStack Image Storage (glance) |
| vms | vms_data | Storage for VM/Instance disk |
| backups | backups_data | Storage for OpenStack Block Storage Backup (cinder-backup) |
| metrics | metrics_data | Storage for OpenStack Telemetry Metrics (gnocchi) |

Create pool example:

```sh
# if you need to remove a pool, remember change back to false state after deletion
#ceph config set mon mon_allow_pool_delete true

# create new erasure code profile (default will exist)
ceph osd erasure-code-profile set ec-21-profile k=2 m=1 crush-failure-domain=host
ceph osd erasure-code-profile ls
ceph osd erasure-code-profile get ec-21-profile

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

# delete an EC profile
#ceph osd erasure-code-profile rm ec-21-profile

# create pool this will host metadata only after issuing the rbd_default_data_pool, by default the crushmap rule will set as replicated, include the parameter to illustrate the metadata must replicated not erasure coded
ceph osd pool create volumes replicated
ceph osd pool application enable volumes rbd
ceph osd pool application get volumes

# create erasure code enabled data pool
ceph osd pool create volumes_data erasure ec-21-profile
ceph osd pool set volumes_data allow_ec_overwrites true # this must be set for RBD pools to make changes for constantly opened disk file
ceph osd pool application enable volumes_data rbd # Openstack will usually ensure the pool is RBD application enabled, when specifying a data disk we must explicitly set the usage/application mode
ceph osd pool application get volumes_data

# set an EC data pool for the replicated pool, 'volumes' will subsequently only host metadata - THIS is a magic command not documented until 2022, typically in non-RHOSP each service has its own client.<service> user and EC data pool override
rbd config pool set volumes rbd_default_data_pool volumes_data

# If using CephFS with manilla the pool creation is the same, however dictation usage of the data pool is a little simpler and specified in the volume creation, allow_ec_overwrites must also be set for CephFS
#ceph fs new cephfs cephfs_metadata cephfs_data force

# Check pools, notice the 3way replicated pool would consume a total of 97TB where EC efficienciy could now consume a total of 193TB, around 179TB usable at max performance according to the EC calculation previously explained in this document
ceph osd pool ls

device_health_metrics
volumes
images
vms
backups
metrics

ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    310 TiB  306 TiB  4.2 TiB   4.2 TiB       1.36
TOTAL  310 TiB  306 TiB  4.2 TiB   4.2 TiB       1.36

--- POOLS ---
POOL                   ID  PGS  STORED  OBJECTS  USED  %USED  MAX AVAIL
device_health_metrics   1    1     0 B       30   0 B      0     97 TiB
volumes_data            7   32     0 B        0   0 B      0    193 TiB
volumes                 8   32     0 B        1   0 B      0     97 TiB
images                  9   32     0 B        1   0 B      0     97 TiB
vms                    10   32     0 B        1   0 B      0     97 TiB
backups                11   32     0 B        1   0 B      0     97 TiB
metrics                12   32     0 B        1   0 B      0     97 TiB
images_data            13   32     0 B        0   0 B      0    193 TiB
vms_data               14   32     0 B        0   0 B      0    193 TiB
backups_data           15   32     0 B        0   0 B      0    193 TiB
metrics_data           16   32     0 B        0   0 B      0    193 TiB
```

Once Openstack starts to consume disk the EC scheme is apparent.

```sh
# we have created a single 10GB VM Instance, the 10GB is thin provisioned, this Instance uses 1.2GB of space
[Universityops@test ~]$ df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  959M     0  959M   0% /dev
tmpfs          tmpfs     987M     0  987M   0% /dev/shm
tmpfs          tmpfs     987M  8.5M  978M   1% /run
tmpfs          tmpfs     987M     0  987M   0% /sys/fs/cgroup
/dev/vda2      xfs        10G  1.2G  8.9G  12% /
tmpfs          tmpfs     198M     0  198M   0% /run/user/1001

# ceph shows some metadata usage (for the RBD disk image) and 1.3GB of data used in volumes_data, note under an EC scheme we see 2.0GB of consumed disk VS 3.9GB under a 3way replication scheme
[root@ceph1 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    310 TiB  306 TiB  4.2 TiB   4.2 TiB       1.36
TOTAL  310 TiB  306 TiB  4.2 TiB   4.2 TiB       1.36

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B       30      0 B      0     97 TiB
volumes_data            7   32  1.3 GiB      363  2.0 GiB      0    193 TiB
volumes                 8   32    691 B        6   24 KiB      0     97 TiB
images                  9   32    452 B       18  144 KiB      0     97 TiB
vms                    10   32      0 B        1      0 B      0     97 TiB
backups                11   32      0 B        1      0 B      0     97 TiB
metrics                12   32      0 B        1      0 B      0     97 TiB
images_data            13   32  1.7 GiB      220  2.5 GiB      0    193 TiB
vms_data               14   32      0 B        0      0 B      0    193 TiB
backups_data           15   32      0 B        0      0 B      0    193 TiB
metrics_data           16   32      0 B        0      0 B      0    193 TiB


```

## Create RBD user for Openstack, assign capabilities and retrieve access token

Openstack needs credentials to access disk.
Use method 3, generally this is the way Ceph administration is going.

```sh
# 1. Redhat CLI method, one-shot command
#
ceph auth add client.openstack mgr 'allow *' mon 'profile rbd' osd 'profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=backups, profile rbd pool=metrics'

# 2. Manual method, you can update caps this way however all caps must be added at once, they cannot be apended
#
ceph auth get-or-create client.openstack
ceph auth caps client.openstack mgr 'allow *' mon 'profile rbd' osd 'profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=backups, profile rbd pool=metrics'

# Tighter mgr access, this should be fine but not tested with Openstack (official documentation does not cover tighter security model)
#
#ceph auth caps client.openstack mon 'profile rbd' osd 'profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=backups, profile rbd pool=metrics' mgr 'profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=backups, profile rbd pool=metrics'

# 3. Config Method easier to script and backup/source-control
#
# 1) generate a keyring with no caps
# 2) add caps
# 3) import user
ceph-authtool --create-keyring ceph.client.openstack.keyring --gen-key -n client.openstack

# NON EC profile
nano -cw ceph.client.openstack.keyring

[client.openstack]
        key = AQCC5z5jtOmJARAAiFaC2HB4f2pBYfMKWzkkkQ==
        caps mon = 'profile rbd'
        caps osd = 'profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=backups, profile rbd pool=metrics'
        caps mgr = 'allow *'

# EC profile

[client.openstack]
        key = AQCC5z5jtOmJARAAiFaC2HB4f2pBYfMKWzkkkQ==
        caps mon = 'profile rbd'
        caps osd = 'profile rbd pool=volumes, profile rbd pool=volumes_data, profile rbd pool=vms, profile rbd pool=vms_data, profile rbd pool=images, profile rbd pool=images_data, profile rbd pool=backups, profile rbd pool=backups_data, profile rbd pool=metrics, profile rbd pool=metrics_data'
        caps mgr = 'allow *'

ceph auth import -i ceph.client.openstack.keyring
ceph auth ls
```