Troubleshooting Apache Pulsar Issues
See Apache Pulsar for general information and frequently asked questions. |
Troubleshooting Apache Pulsar in Fusion
Clear Pulsar data
Use this procedure to clear a PersistentVolumeClaim
(PVC) in Pulsar bookkeeper data and re-establish services.
The Purge Pulsar data completely section explains how to proceed if the procedure to correct an Invalid cookie exception is not successful. |
-
Execute the following command to obtain the number of existing replicas of statefulsets (STS) for both Pulsar Bookkeeper and Broker:
BOOKKEEPER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=bookkeeper" -o jsonpath="{.items[0].spec.replicas}" ; echo) BROKER_REPLICAS=$(kubectl get sts -l "app=pulsar,component=broker" -o jsonpath="{.items[0].spec.replicas}" ; echo)
-
Execute the following command to scale down the STS:
kubectl get sts -l "app=pulsar" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=0 sts
-
Execute the following command to delete the Pulsar PVCs:
kubectl get pvc -l "app=pulsar" --no-headers | awk '{print $1}' | xargs kubectl delete pvc
When this command is run in Google Kubernetes Engine (GKE), the PersistentVolumes(PVs) associated with the PVCs are also deleted. If you are working in a non-GKE environment, run the command to determine if the PVs associated with the PVCs are deleted. If the PVs are not deleted, you must delete them before proceeding. If the PVs are not deleted, the PVs and their data are reused, which causes unpredictable behavior when the system is restarted. -
Sign in to Zookeeper and execute the following command:
ZOOKEEPER_NODE=$(kubectl get pods -l "app=zookeeper" --no-headers | awk '{print $1}' | head -n1) kubectl exec -it $ZOOKEEPER_NODE /bin/bash
-
Execute the following commands in Zookeeper to run the
zzkCli.sh
script and delete Pulsar information:cd bin zkCli.sh deleteall /pulsar quit exit
-
Execute the following commands to scale the
pulsar-bookkeeper
replica count to its previous numbers:kubectl get sts -l "app=pulsar,component=bookkeeper" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=$BOOKKEEPER_REPLICAS sts
-
When at least one
pulsar-bookkeeper
created from the previous command is successfully running, execute the following command to scale thepulsar-broker
replica count to its previous numbers:kubectl get sts -l "app=pulsar,component=broker" --no-headers | awk '{print $1}' | xargs kubectl scale --replicas=$BROKER_REPLICAS sts
-
When at least one
pulsar-broker
created from the previous command is successfully running, sign in to thatpulsar-broker
instance and execute the following commands:BROKER_NODE=$(kubectl get pods -l "app=pulsar,component=broker" --no-headers | grep '1/1' | grep Running | head -n1 | awk '{print $1}') kubectl exec -it $BROKER_NODE /bin/bash cd bin
-
While still signed into the
pulsar-broker
pod, execute the following commands to create a:-
Tenant with the same name as the current Kubernetes namespace
-
Namespace inside the new tenant named
_logs
-
Topic inside the new namespace named
system_logs
In the commands, replace $(NAMESPACE) with the name of your current Kubernetes namespace. Examples
./pulsar-admin tenants create $(NAMESPACE) ./pulsar-admin namespaces create $(NAMESPACE)/_logs ./pulsar-admin topics create persistent://$(NAMESPACE)/_logs/system_logs exit
For example, if the namespace is
dev
, the command is:./pulsar-admin tenants create dev ./pulsar-admin namespaces create dev/_logs ./pulsar-admin topics create persistent://dev/_logs/system_logs exit
-
-
Execute the following command to restart all applicable deployments:
kubectl get deployments --no-headers | awk '{print $1}' | kubectl rollout restart deployment
Invalid cookie exception
The org.apache.bookkeeper.bookie.BookieException$InvalidCookieException
error is typically generated when corrupted or lost data exists on the persistent volumes mounted on bookkeeper nodes.
-
Verify the
managedLedgerDefaultWriteQuorum
field contains an adequate number of bookkeepers for a write quorum. The default value is 2. If necessary, increase the value. -
Execute the following command to add more bookkeeper nodes to scale up the
pulsar-bookkeeper
STS:kubectl scale -n example example-pulsar-bookkeeper --replicas 4
If the STS cannot scale up, one of the pods in sequence may be failing to start. Temporarily disable the readiness probe and increase the value of initialDelaySeconds
on the liveness probe. This allows time so the failing probe does not prevent the next pod creation. -
When a sufficient number of healthy bookkeeper nodes exist for a quorum, execute the following command to obtain the failing bookkeeper cookie IDs from Zookeeper:
kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies
The command returns a list of all bookkeeper IDs registered with the cluster. For example:
$ kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh ls /pulsar/ledgers/cookies [example-pulsar-bookkeeper-0.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181, example-pulsar-bookkeeper-2.example-pulsar-bookkeeper.example.svc.cluster.local:3181]
-
Execute the following command to decommission each of the corrupted bookkeeper nodes:
decommissionbookie
For example, if the
example-pulsar-bookkeeper-1
node is corrupt, execute the following command:kubectl exec -n example example-pulsar-bookkeeper-0 -- /pulsar/bin/bookkeeper shell decommissionbookie -bookieid example-pulsar-bookkeeper-1.example-pulsar-bookkeeper.example.svc.cluster.local:3181
The command must finish before executing the command on another corrupt node. -
Restore the readiness probe and liveness probe to their original state if they were removed or modified earlier in this procedure.
-
Execute the following command to delete the each of the failing pods that were decommissioned. For example:
kubectl delete pod -n example example-pulsar-bookkeeper-1
You must wait until each of the deleted pods are replaced and the new pods are in a "healthy" state. -
If necessary, set the value of the
pulsar-bookkeeper
STS to its original size.
If any additional nodes are removed, follow the steps of this procedure to decommission and delete them. |
Purge Pulsar data completely
If the procedure defined in Invalid cookie exception cannot be completed, and there are no concerns about losing existing Pulsar metadata, use this procedure to purge Pulsar Zookeeper data.
Downtime is required to purge Pulsar metadata completely. |
-
Execute the following commands to scale down the
pulsar-broker
andpulsar-bookkeeper
STS to zero replicas:kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=0 kubectl scale -n example sts/example-pulsar-broker --replicas=0
You must wait until all of the pods are deleted before proceeding. -
Execute the following command to list all of the PVCs for
pulsar-bookkeeper
pods:kubectl get pvc -n example -o name | grep pulsar-bookkeeper
-
Verify the correct bookkeeper PVCs are listed for your cluster/namespace.
-
Execute the following command to delete each of the PVCs owned by previously-existing
pulsar-bookkeeper
pods:kubectl get pvc -n example -o name | grep pulsar-bookkeeper | xargs -I {} kubectl delete -n example {}
Repeat this command for each PVC to delete them one by one. -
Execute the following command to delete all nodes under the
/pulsar
path in Zookeeper:kubectl exec -n example example-zookeeper-0 -- bin/zkCli.sh deleteall /pulsar
-
Execute the following commands to scale the
pulsar-broker
andpulsar-bookkeeper
STS to the appropriate number of replicas:kubectl scale -n example sts/example-pulsar-bookkeeper --replicas=3 kubectl scale -n example sts/example-pulsar-broker --replicas=2
You must wait until all of the pods are scheduled and ready before proceeding. -
Restart all remaining Fusion services pods except
zookeeper
,solr
,pulsar-broker
, andpulsar-bookkeeper
. Based on the Fusion deployment customization for your system, there may be other pods that should not be restarted.The best practice recommendation is to restart services individually, one at a time. And Fusion deployment specifics must be taken into account before deleting pods. If you determine you want to delete all pods in a namespace, execute the following command: kubectl get pods -n example -o name | grep -v "zookeeper\|solr\|pulsar" | xargs -I {} kubectl delete -n example {}
.
OutOfDirectMemoryError
The bookkeeper pod can exit due to an OutOfDirectMemoryError
, which is typically generated in clusters with high Pulsar throughput and/or slow disk input/output (I/O). For example:
[bookie-io-1-1] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting bookie io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2919235615, max: 2936012800)
Bookkeeper uses direct memory to cache data before it is written to disk.
To allocate more direct memory based on the expected highest network throughput relative to the disk I/O:
-
Access the
bookkeeper.configData.PULSAR_MEM
helm parameter. -
Add
-XX:MaxDirectMemorySize
with the appropriate value.For example,
-XX:MaxDirectMemorySize=4G
allocates 4 gigabytes of direct memory.