Troubleshooting

A brief introduction to troubleshooting steps

Troubleshooting a OTH Platform not responding as expected usually boils down to either an error in the network configuration, DNS lookup issues or services not running.

Understanding how the flow of network-traffic to your OTH environment is key to understanding how to troubleshoot :

all external network-traffic must go through the frontend server and be called with the domain name you have setup for your OTH environment e.g. “mytelehealthsolution.org”.
the frontend server forwards the network-traffic to the app server on port 80/TCP based on a few simple rules.
a traefik proxy server receives the network-traffic from the frontend server and distribute it to the relevant micro-service based on the url path.
each micro-service call the OTH environment if needed with the domain-name you have setup for your OTH environment.
the application servers must be able to resolve the domain name for your solution in DNS. It is not enough to configure this in /etc/hosts, the docker containers will not pick this up.
most micro-services needs access to RabbitMQ and MySQL endpoints, for internal communication and for storing persistent data.

In our experience, 99% of the time that the OTH platform doesn’t start as expected it is due to network related issue like services unable to resolve the domain name for the solution by DNS, connecting to other services on required ports or other services hasn’t yet started up properly.

a step by step guide to troubleshooting the OTH platform.

Assumption: We assume that the frontend, application-server and storage backend are up and running, are accessible by SSH and needed tools like curl, nc, mysql client are installed on the servers.

Services running?

$ sudo docker ps
CONTAINER ID   IMAGE                                                                               COMMAND   CREATED       STATUS                 PORTS      NAMES
43fd311fc5b3   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/clinician:2.82.2_build1            "/init"   4 hours ago   Up 4 hours (healthy)   8080/tcp   clinician
34cf68e91a08   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/client-citizen:2.80.1_build3-oth   "/init"   5 days ago    Up 5 days (healthy)    8000/tcp   client-citizen
c96a09fa8e12   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/measurements:1.15.1_build1         "/init"   5 days ago    Up 5 days (healthy)    8220/tcp   measurements
016d30f8baba   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/developer-portal:2.80.0_build7     "/init"   5 days ago    Up 5 days (healthy)    8380/tcp   developer-portal
98ac24fd707c   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/notifications:1.7.0_build10        "/init"   5 days ago    Up 5 days (healthy)    8340/tcp   notifications
45a9be60a88e   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/chat:1.3.0_build7                  "/init"   5 days ago    Up 5 days (healthy)    8400/tcp   chat
94bc6db70881   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/questionnaires:1.5.0_build2        "/init"   5 days ago    Up 5 days (healthy)    8350/tcp   questionnaires
2a4fa6381aa7   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/results:1.2.0_build4               "/init"   5 days ago    Up 5 days (healthy)    8430/tcp   results
ae64101cce5a   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/ecg:2.2.0_build3                   "/init"   5 days ago    Up 5 days (healthy)    8150/tcp   ecg
36ec400eea36   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/thresholds:1.11.0_build8           "/init"   5 days ago    Up 5 days (healthy)    8280/tcp   thresholds
ea8110ca4edb   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/guidance:1.7.0_build3              "/init"   5 days ago    Up 5 days (healthy)    8320/tcp   guidance
737dc49c5312   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/audit:1.4.1_build1                 "/init"   5 days ago    Up 5 days (healthy)    8390/tcp   audit
3e60c4ee28ed   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/organizations:1.2.0_build6         "/init"   5 days ago    Up 5 days (healthy)    8420/tcp   organizations
7ca6891c046e   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/qnr-ui:1.17.1_build1               "/init"   5 days ago    Up 5 days (healthy)    8100/tcp   qnrui
03693fd50073   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/logging:1.8.0_build2               "/init"   5 days ago    Up 5 days (healthy)    8270/tcp   logging
1b1b31a0004f   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/object-storage:1.3.0_build4        "/init"   5 days ago    Up 5 days (healthy)    8410/tcp   object-storage
b01cac18d525   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/uis:3.8.1_build2                   "/init"   5 days ago    Up 5 days (healthy)    8250/tcp   uis
f14d142aa3eb   401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/idp2:1.23.2_build1                 "/init"   5 days ago    Up 5 days (healthy)    8140/tcp   idp2

The status field indicates wether the service has started (showing as healthy), is “starting up” or is “unhealthy”. If all services are healthy, move on to the next step. Unhealthy services should be investigated by looking either in the journal log e.g.: sudo journalctl -u clinician or in the services log file less /var/log/clinician/clinician.log (or relevant log file for the given service).

Common issues for services not starting

The container could not be downloaded from OTH’s container registry.

make sure you can reach OTH’s ECR. Address can be found in the service file for the service e.g. /etc/systemd/system/clinician
make sure you are using correct AWS credentials, provided by the OTH technical support.
if you are using branding, make sure to check with OTH support that your container has been build properly.

Service can’t reach backend services

check that you can reach rabbitmq, mysql and minio services from the application server. This can be done by issuing the command nc <ip of the storage server> <service port>
make sure the service is configured with correct credentials and hostname/ip address to reach the backend services.
make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports.

Service can’t reach frontend

check if you can resolve the domain name of your environment in dns on the application server e.g. dig mytelehealth.solution.io. Check if it points to the correct IP of the frontend server.
check if you can reach the frontend server from the application server with the domain name e.g. curl -v https://mytelehealth.solution.io.
make sure that traefik service is running on the application server.
make sure that the HAPROXY service is running on the frontend server and are reachable on port 80 and 443/TCP from the outside.
make sure the HAPROXY is configured with correct domain name and correct backend server reference to the application server.
if using hostname for application server, make sure it is resolvable on the frontend server.
make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports on the frontend server (allow 80,443/TCP) and the application server (allow 80/TCP from frontend server IP).

Service can’t reach other application services

check if the traefik service is running on the application server.
make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports on the application server (allow 80/TCP from frontend server IP).
if you have checked the above steps and corrected any errors, you should be able to reach other services now.

service dependencies

“idp2” and “audit” services are needed to be up and running, before any other services can start up correctly. Both “idp2” and “audit” services needs access to the RabbitMQ service and the Database service, in order for it to start up.

Common error codes

503 :

Service awaits other service(s) to be available:

Typically most services will return 503 if the idp2 service is unavailable. If multiple services are returning 503, please check if idp2 service is running, before trying to troubleshoot further. If the idp2 service is not running properly, please try to get the idp2 service up and running before troubleshooting further on the other services. After the idp2 service is up correctly, most 503 alarms should disappear on their own shortly after.

URI endpoint doesn’t exists:

The HAPROXY will return a 503 if the URI endpoint is non-existent. Please ensure your URI is correct and its corresponding service exists and are up and running and reachable from the front-end node.

Migration scripts not yet complete:

This usually only happens right after an upgrade to a new release. On large data sets in the database, the migration-scripts can take several minutes to process and finish. The JSON body output from the clinician health endpoint might indicate that migration is in progress, poll the clinician health endpoint (e.g. https://mytelehealth-solution.com/clinician/health) as a first step when troubleshooting.

501 :

Service has for unknown reason ended in a state of internal error:

The most common cause for a service returning 501 is usually the lack of resources such as memory, disk and CPU. In rare cases the cause would be a actual bug in the code or supporting software components. Further investigation could be to look at monitoring metrics for the resources and the log files for the service to get a clearer view of what happened.

401 :

Wrong or expired credentials:

Commonly this would be experienced when calling services through the REST API. 401 indicates that authentication has failed and typically this would either be the result of a wrong API token or the use of a expired token.

403 :

Missing permissions:

Being authenticated, but lacking proper permissions to access resources would like end up in a 403. Check if the user you are authenticated with have permission to access the resource or endpoint your are trying to reach.

404 :

Missing resource:

The resource you requested doesn’t exists or is in the process of starting up. Make sure your request is a valid endpoint.

Database lock (clinician):

If you see a “ERROR context.GrailsContextLoaderListener - Error initializing the application: Could not acquire change log lock.” error in the /var/log/clinician/clinician.log the way to remedy this is to perform following steps:

connect to the database
do the following commands each followed by enter
- use clinician; # NOTE! If you are using legacy naming, use the dbname from clinician:config:db:name: in your settings.yml instead of clinician.
- update DATABASECHANGELOGLOCK SET LOCKED=0,LOCKGRANTED=NULL,LOCKEDBY=NULL WHERE ID=1;
- quit
wait for the clinician service to start it’s run (do NOT restart the clinician service or the server). This can take up to 10 minutes.
If it still fails, you need to contact support for further assistance.

Certificates

Common reason for SSL Certificates not working:

Certificate has expired
Missing CA and Intermediate Certificates in your configuration
Revoked Certificate
Issued to a mismatching domain name

idp2 and/or audit service is not available:

Make sure that the idp2 and the audit services is running correctly.

clinician and client-citizen service is not fully up:

If the admin or clinician users can’t login in general, the clinician service might not be fully up.
If patients in general can’t login, it indicates that the client-citizen service is not fully up.

Database unavailable:

Users can be presented with login screen even though the clinician and client-citizen service can’t reach the back-end database endpoint. Please ensure the database is available and reachable from the services.

User related causes:

user account doesn’t exists.
user credential is wrong.
user account is locked.

User related causes needs to be resolved by a platform administrator or clinician user, which is outside the scope of this manual. Please refer to the manuals for the platform.

Reporting a bug

Submitting a bug to OTH tech-support

When reporting a bug, the following things should be specified in the your request in order for us to be able to properly understand and reproduce the bug you are experiencing:

The environment on which the error occurred, i.e. The URL of the server.
Description of the expected result and the actual result.
A step-by-step guide on how to reproduce the bug, if possible.
Include a screenshot of the error message, if relevant.
If the bug relates to a Bluetooth integration then include the model of the Android/iOS device + OS version and a picture of the Bluetooth device’s information, usually found on the back of the device.
If the bug occurs when trying to complete a questionnaire in the patient app please include the JSON export of the questionnaire containing the error, as an attachment to the email.

To report a bug, submit a request through https://opentelehealth.zendesk.com

Contacting support

Please direct any questions or issues related to an OTH installation to OTH technical support.

Provide the URL of the server in which the issue retains to.
Provide as detailed a description as possible of the nature of the issue.
Please be mindful of deadlines/execution times concerning branding, upgrades etc.
For larger issues like migration, projects, etc. Please direct this to your key-account manager/project manager for initial planning.

OTH support can be reached through https://opentelehealth.zendesk.com