Troubleshooting Apache Pulsar Issues

Table of Contents

Troubleshooting Apache Pulsar in Fusion

See Apache Pulsar for general information and frequently asked questions.

Troubleshooting Apache Pulsar in Fusion

Clear Pulsar data

Use this procedure to clear a PersistentVolumeClaim (PVC) in Pulsar bookkeeper data and re-establish services.

The Purge Pulsar data completely section explains how to proceed if the procedure to correct an Invalid cookie exception is not successful.

Execute the following command to obtain the number of existing replicas of statefulsets (STS) for both Pulsar Bookkeeper and Broker:

BOOKKEEPER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=bookkeeper" -o jsonpath="{.items[0].spec.replicas}" ; echo)
BROKER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=broker" -o jsonpath="{.items[0].spec.replicas}" ; echo)

Execute the following command to scale down the STS:

kubectl get sts -l "app=pulsar" --no-headers | awk '{print $1}'  | xargs kubectl scale --replicas=0 sts

Execute the following command to delete the Pulsar PVCs:

kubectl get pvc -l "app=pulsar" --no-headers | awk '{print $1}' | xargs kubectl delete pvc

When this command is run in Google Kubernetes Engine (GKE), the PersistentVolumes(PVs) associated with the PVCs are also deleted. If you are working in a non-GKE environment, run the command to determine if the PVs associated with the PVCs are deleted. If the PVs are not deleted, you must delete them before proceeding. If the PVs are not deleted, the PVs and their data are reused, which causes unpredictable behavior when the system is restarted.

ZOOKEEPER_NODE=$(kubectl get pods -l "app=zookeeper" --no-headers | awk '{print $1}' | head -n1)
kubectl exec -it $ZOOKEEPER_NODE /bin/bash

Execute the following commands in Zookeeper to run the zzkCli.sh script and delete Pulsar information:
```
cd bin
zkCli.sh
deleteall /pulsar
quit
exit
```

Execute the following commands to scale the pulsar-bookkeeper replica count to its previous numbers:

kubectl get sts -l "app=pulsar,component=bookkeeper" --no-headers | awk '{print $1}'  | xargs kubectl scale --replicas=$BOOKKEEPER_REPLICAS sts

When at least one pulsar-bookkeeper created from the previous command is successfully running, execute the following command to scale the pulsar-broker replica count to its previous numbers:
```
kubectl get sts -l "app=pulsar,component=broker" --no-headers | awk '{print $1}'  | xargs kubectl scale --replicas=$BROKER_REPLICAS sts
```

When at least one pulsar-broker created from the previous command is successfully running, sign in to that pulsar-broker instance and execute the following commands:

BROKER_NODE=$(kubectl get pods -l "app=pulsar,component=broker" --no-headers | grep '1/1' | grep Running | head -n1 | awk '{print $1}')
kubectl exec -it $BROKER_NODE /bin/bash
cd bin

While still signed into the pulsar-broker pod, execute the following commands to create a:
- Tenant with the same name as the current Kubernetes namespace
- Namespace inside the new tenant named _logs
- Topic inside the new namespace named system_logs
  
  In the commands, replace $(NAMESPACE) with the name of your current Kubernetes namespace.
  
  Examples
  ./pulsar-admin tenants create $(NAMESPACE) ./pulsar-admin namespaces create $(NAMESPACE)/_logs ./pulsar-admin topics create persistent://$(NAMESPACE)/_logs/system_logs exit
  For example, if the namespace is dev, the command is:
  ./pulsar-admin tenants create dev ./pulsar-admin namespaces create dev/_logs ./pulsar-admin topics create persistent://dev/_logs/system_logs exit

Execute the following command to restart all applicable deployments:

kubectl get deployments --no-headers | awk '{print $1}' | kubectl rollout restart deployment

The org.apache.bookkeeper.bookie.BookieException$InvalidCookieException error is typically generated when corrupted or lost data exists on the persistent volumes mounted on bookkeeper nodes.

Verify the managedLedgerDefaultWriteQuorum field contains an adequate number of bookkeepers for a write quorum. The default value is 2. If necessary, increase the value.

Execute the following command to add more bookkeeper nodes to scale up the pulsar-bookkeeper STS:

kubectl scale -n example example-pulsar-bookkeeper --replicas 4

If the STS cannot scale up, one of the pods in sequence may be failing to start. Temporarily disable the readiness probe and increase the value of initialDelaySeconds on the liveness probe. This allows time so the failing probe does not prevent the next pod creation.

When a sufficient number of healthy bookkeeper nodes exist for a quorum, execute the following command to obtain the failing bookkeeper cookie IDs from Zookeeper:

kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies

The command returns a list of all bookkeeper IDs registered with the cluster. For example:

$ kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies

[example-pulsar-bookkeeper-0.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-2.example-pulsar-bookkeeper.example.svc.cluster.local:3181]

Execute the following command to decommission each of the corrupted bookkeeper nodes:
```
decommissionbookie
```
For example, if the example-pulsar-bookkeeper-1 node is corrupt, execute the following command:
```
kubectl exec -n example example-pulsar-bookkeeper-0 -- /pulsar/bin/bookkeeper shell decommissionbookie -bookieid example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181
```
The command must finish before executing the command on another corrupt node.
Restore the readiness probe and liveness probe to their original state if they were removed or modified earlier in this procedure.
Execute the following command to delete the each of the failing pods that were decommissioned. For example:
```
kubectl delete pod -n example example-pulsar-bookkeeper-1
```
You must wait until each of the deleted pods are replaced and the new pods are in a "healthy" state.
If necessary, set the value of the pulsar-bookkeeper STS to its original size.

If any additional nodes are removed, follow the steps of this procedure to decommission and delete them.

Purge Pulsar data completely

If the procedure defined in Invalid cookie exception cannot be completed, and there are no concerns about losing existing Pulsar metadata, use this procedure to purge Pulsar Zookeeper data.

Downtime is required to purge Pulsar metadata completely.

Execute the following commands to scale down the pulsar-broker and pulsar-bookkeeper STS to zero replicas:
```
kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=0
kubectl scale -n example sts/example-pulsar-broker --replicas=0
```
You must wait until all of the pods are deleted before proceeding.
Execute the following command to list all of the PVCs for pulsar-bookkeeper pods:
```
kubectl get pvc -n example -o name | grep pulsar-bookkeeper
```
Verify the correct bookkeeper PVCs are listed for your cluster/namespace.
Execute the following command to delete each of the PVCs owned by previously-existing pulsar-bookkeeper pods:
```
kubectl get pvc -n example -o name | grep pulsar-bookkeeper | xargs -I {} kubectl delete -n example {}
```
Repeat this command for each PVC to delete them one by one.
Execute the following command to delete all nodes under the /pulsar path in Zookeeper:
```
kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh deleteall /pulsar
```
Execute the following commands to scale the pulsar-broker and pulsar-bookkeeper STS to the appropriate number of replicas:
```
kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=3
kubectl scale -n example sts/example-pulsar-broker --replicas=2
```
You must wait until all of the pods are scheduled and ready before proceeding.

Restart all remaining Fusion services pods except zookeeper, solr, pulsar-broker, and pulsar-bookkeeper. Based on the Fusion deployment customization for your system, there may be other pods that should not be restarted.

The best practice recommendation is to restart services individually, one at a time. And Fusion deployment specifics must be taken into account before deleting pods. If you determine you want to delete all pods in a namespace, execute the following command: kubectl get pods -n example -o name | grep -v "zookeeper\|solr\|pulsar" | xargs -I {} kubectl delete -n example {}.

OutOfDirectMemoryError

The bookkeeper pod can exit due to an OutOfDirectMemoryError, which is typically generated in clusters with high Pulsar throughput and/or slow disk input/output (I/O). For example:

[bookie-io-1-1] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting bookie
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2919235615, max: 2936012800)

Bookkeeper uses direct memory to cache data before it is written to disk.

To allocate more direct memory based on the expected highest network throughput relative to the disk I/O:

Access the bookkeeper.configData.PULSAR_MEM helm parameter.
Add -XX:MaxDirectMemorySize with the appropriate value.

For example, -XX:MaxDirectMemorySize=4G allocates 4 gigabytes of direct memory.