Troubleshooting Apache Pulsar Issues
| See Apache Pulsar for general information and frequently asked questions. |
Troubleshooting Apache Pulsar in Fusion
Clear Pulsar data
Use this procedure to clear a PersistentVolumeClaim (PVC) in Pulsar bookkeeper data and re-establish services.
| The Purge Pulsar data completely section explains how to proceed if the procedure to correct an Invalid cookie exception is not successful. |
-
Execute the following command to obtain the number of existing replicas of statefulsets (STS) for both Pulsar Bookkeeper and Broker:
BOOKKEEPER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=bookkeeper" -o jsonpath="{.items[0].spec.replicas}" ; echo) BROKER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=broker" -o jsonpath="{.items[0].spec.replicas}" ; echo) -
Execute the following command to scale down the STS:
kubectl get sts -l "app=pulsar" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=0 sts -
Execute the following command to delete the Pulsar PVCs:
kubectl get pvc -l "app=pulsar" --no-headers | awk '{print $1}' | xargs kubectl delete pvcWhen this command is run in Google Kubernetes Engine (GKE), the PersistentVolumes(PVs) associated with the PVCs are also deleted. If you are working in a non-GKE environment, run the command to determine if the PVs associated with the PVCs are deleted. If the PVs are not deleted, you must delete them before proceeding. If the PVs are not deleted, the PVs and their data are reused, which causes unpredictable behavior when the system is restarted. -
Sign in to Zookeeper and execute the following command:
ZOOKEEPER_NODE=$(kubectl get pods -l "app=zookeeper" --no-headers | awk '{print $1}' | head -n1) kubectl exec -it $ZOOKEEPER_NODE /bin/bash -
Execute the following commands in Zookeeper to run the
zzkCli.shscript and delete Pulsar information:cd bin zkCli.sh deleteall /pulsar quit exit
-
Execute the following commands to scale the
pulsar-bookkeeperreplica count to its previous numbers:kubectl get sts -l "app=pulsar,component=bookkeeper" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=$BOOKKEEPER_REPLICAS sts -
When at least one
pulsar-bookkeepercreated from the previous command is successfully running, execute the following command to scale thepulsar-brokerreplica count to its previous numbers:kubectl get sts -l "app=pulsar,component=broker" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=$BROKER_REPLICAS sts -
When at least one
pulsar-brokercreated from the previous command is successfully running, sign in to thatpulsar-brokerinstance and execute the following commands:BROKER_NODE=$(kubectl get pods -l "app=pulsar,component=broker" --no-headers | grep '1/1' | grep Running | head -n1 | awk '{print $1}') kubectl exec -it $BROKER_NODE /bin/bash cd bin -
While still signed into the
pulsar-brokerpod, execute the following commands to create a:-
Tenant with the same name as the current Kubernetes namespace
-
Namespace inside the new tenant named
_logs -
Topic inside the new namespace named
system_logsIn the commands, replace $(NAMESPACE) with the name of your current Kubernetes namespace. Examples
./pulsar-admin tenants create $(NAMESPACE) ./pulsar-admin namespaces create $(NAMESPACE)/_logs ./pulsar-admin topics create persistent://$(NAMESPACE)/_logs/system_logs exit
For example, if the namespace is
dev, the command is:./pulsar-admin tenants create dev ./pulsar-admin namespaces create dev/_logs ./pulsar-admin topics create persistent://dev/_logs/system_logs exit
-
-
Execute the following command to restart all applicable deployments:
kubectl get deployments --no-headers | awk '{print $1}' | kubectl rollout restart deployment
Invalid cookie exception
The org.apache.bookkeeper.bookie.BookieException$InvalidCookieException error is typically generated when corrupted or lost data exists on the persistent volumes mounted on bookkeeper nodes.
-
Verify the
managedLedgerDefaultWriteQuorumfield contains an adequate number of bookkeepers for a write quorum. The default value is 2. If necessary, increase the value. -
Execute the following command to add more bookkeeper nodes to scale up the
pulsar-bookkeeperSTS:kubectl scale -n example example-pulsar-bookkeeper --replicas 4
If the STS cannot scale up, one of the pods in sequence may be failing to start. Temporarily disable the readiness probe and increase the value of initialDelaySecondson the liveness probe. This allows time so the failing probe does not prevent the next pod creation. -
When a sufficient number of healthy bookkeeper nodes exist for a quorum, execute the following command to obtain the failing bookkeeper cookie IDs from Zookeeper:
kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies
The command returns a list of all bookkeeper IDs registered with the cluster. For example:
$ kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies [example-pulsar-bookkeeper-0.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-2.example-pulsar-bookkeeper.example.svc.cluster.local:3181]
-
Execute the following command to decommission each of the corrupted bookkeeper nodes:
decommissionbookie
For example, if the
example-pulsar-bookkeeper-1node is corrupt, execute the following command:kubectl exec -n example example-pulsar-bookkeeper-0 -- /pulsar/bin/bookkeeper shell decommissionbookie -bookieid example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181
The command must finish before executing the command on another corrupt node. -
Restore the readiness probe and liveness probe to their original state if they were removed or modified earlier in this procedure.
-
Execute the following command to delete the each of the failing pods that were decommissioned. For example:
kubectl delete pod -n example example-pulsar-bookkeeper-1
You must wait until each of the deleted pods are replaced and the new pods are in a "healthy" state. -
If necessary, set the value of the
pulsar-bookkeeperSTS to its original size.
| If any additional nodes are removed, follow the steps of this procedure to decommission and delete them. |
Purge Pulsar data completely
If the procedure defined in Invalid cookie exception cannot be completed, and there are no concerns about losing existing Pulsar metadata, use this procedure to purge Pulsar Zookeeper data.
| Downtime is required to purge Pulsar metadata completely. |
-
Execute the following commands to scale down the
pulsar-brokerandpulsar-bookkeeperSTS to zero replicas:kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=0 kubectl scale -n example sts/example-pulsar-broker --replicas=0
You must wait until all of the pods are deleted before proceeding. -
Execute the following command to list all of the PVCs for
pulsar-bookkeeperpods:kubectl get pvc -n example -o name | grep pulsar-bookkeeper
-
Verify the correct bookkeeper PVCs are listed for your cluster/namespace.
-
Execute the following command to delete each of the PVCs owned by previously-existing
pulsar-bookkeeperpods:kubectl get pvc -n example -o name | grep pulsar-bookkeeper | xargs -I {} kubectl delete -n example {}Repeat this command for each PVC to delete them one by one. -
Execute the following command to delete all nodes under the
/pulsarpath in Zookeeper:kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh deleteall /pulsar
-
Execute the following commands to scale the
pulsar-brokerandpulsar-bookkeeperSTS to the appropriate number of replicas:kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=3 kubectl scale -n example sts/example-pulsar-broker --replicas=2
You must wait until all of the pods are scheduled and ready before proceeding. -
Restart all remaining Fusion services pods except
zookeeper,solr,pulsar-broker, andpulsar-bookkeeper. Based on the Fusion deployment customization for your system, there may be other pods that should not be restarted.The best practice recommendation is to restart services individually, one at a time. And Fusion deployment specifics must be taken into account before deleting pods. If you determine you want to delete all pods in a namespace, execute the following command: kubectl get pods -n example -o name | grep -v "zookeeper\|solr\|pulsar" | xargs -I {} kubectl delete -n example {}.
OutOfDirectMemoryError
The bookkeeper pod can exit due to an OutOfDirectMemoryError, which is typically generated in clusters with high Pulsar throughput and/or slow disk input/output (I/O). For example:
[bookie-io-1-1] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting bookie io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2919235615, max: 2936012800)
Bookkeeper uses direct memory to cache data before it is written to disk.
To allocate more direct memory based on the expected highest network throughput relative to the disk I/O:
-
Access the
bookkeeper.configData.PULSAR_MEMhelm parameter. -
Add
-XX:MaxDirectMemorySizewith the appropriate value.For example,
-XX:MaxDirectMemorySize=4Gallocates 4 gigabytes of direct memory.