Troubleshooting
A brief introduction to troubleshooting steps
Troubleshooting a OTH Platform not responding as expected usually boils down to either an error in the network configuration, DNS lookup issues or services not running.
Understanding how the flow of network-traffic to your OTH environment is key to understanding how to troubleshoot :
- all external network-traffic must go through the frontend server and be called with the domain name you have setup for your OTH environment e.g. “mytelehealthsolution.org”.
- the frontend server forwards the network-traffic to the app server on port 80/TCP based on a few simple rules.
- a traefik proxy server receives the network-traffic from the frontend server and distribute it to the relevant micro-service based on the url path.
- each micro-service call the OTH environment if needed with the domain-name you have setup for your OTH environment.
- the application servers must be able to resolve the domain name for your solution in DNS. It is not enough to configure this in
/etc/hosts
, the docker containers will not pick this up. - most micro-services needs access to RabbitMQ and MySQL endpoints, for internal communication and for storing persistent data.
In our experience, 99% of the time that the OTH platform doesn’t start as expected it is due to network related issue like services unable to resolve the domain name for the solution by DNS, connecting to other services on required ports or other services hasn’t yet started up properly.
a step by step guide to troubleshooting the OTH platform.
Assumption: We assume that the frontend, application-server and storage backend are up and running, are accessible by SSH and needed tools like curl, nc, mysql client are installed on the servers.
Services running?
Login to the application server(s) and issue a sudo docker ps
expect an output similar to this :
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
43fd311fc5b3 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/clinician:2.82.2_build1 "/init" 4 hours ago Up 4 hours (healthy) 8080/tcp clinician
34cf68e91a08 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/client-citizen:2.80.1_build3-oth "/init" 5 days ago Up 5 days (healthy) 8000/tcp client-citizen
c96a09fa8e12 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/measurements:1.15.1_build1 "/init" 5 days ago Up 5 days (healthy) 8220/tcp measurements
016d30f8baba 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/developer-portal:2.80.0_build7 "/init" 5 days ago Up 5 days (healthy) 8380/tcp developer-portal
98ac24fd707c 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/notifications:1.7.0_build10 "/init" 5 days ago Up 5 days (healthy) 8340/tcp notifications
45a9be60a88e 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/chat:1.3.0_build7 "/init" 5 days ago Up 5 days (healthy) 8400/tcp chat
94bc6db70881 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/questionnaires:1.5.0_build2 "/init" 5 days ago Up 5 days (healthy) 8350/tcp questionnaires
2a4fa6381aa7 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/results:1.2.0_build4 "/init" 5 days ago Up 5 days (healthy) 8430/tcp results
ae64101cce5a 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/ecg:2.2.0_build3 "/init" 5 days ago Up 5 days (healthy) 8150/tcp ecg
36ec400eea36 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/thresholds:1.11.0_build8 "/init" 5 days ago Up 5 days (healthy) 8280/tcp thresholds
ea8110ca4edb 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/guidance:1.7.0_build3 "/init" 5 days ago Up 5 days (healthy) 8320/tcp guidance
737dc49c5312 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/audit:1.4.1_build1 "/init" 5 days ago Up 5 days (healthy) 8390/tcp audit
3e60c4ee28ed 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/organizations:1.2.0_build6 "/init" 5 days ago Up 5 days (healthy) 8420/tcp organizations
7ca6891c046e 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/qnr-ui:1.17.1_build1 "/init" 5 days ago Up 5 days (healthy) 8100/tcp qnrui
03693fd50073 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/logging:1.8.0_build2 "/init" 5 days ago Up 5 days (healthy) 8270/tcp logging
1b1b31a0004f 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/object-storage:1.3.0_build4 "/init" 5 days ago Up 5 days (healthy) 8410/tcp object-storage
b01cac18d525 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/uis:3.8.1_build2 "/init" 5 days ago Up 5 days (healthy) 8250/tcp uis
f14d142aa3eb 401334847138.dkr.ecr.eu-west-1.amazonaws.com/oth/idp2:1.23.2_build1 "/init" 5 days ago Up 5 days (healthy) 8140/tcp idp2
The status field indicates wether the service has started (showing as healthy), is “starting up” or is “unhealthy”. If all services are healthy, move on to the next step. Unhealthy services should be investigated by looking either in the journal log e.g.: sudo journalctl -u clinician
or in the services log file less /var/log/clinician/clinician.log
(or relevant log file for the given service).
Common issues for services not starting
The container could not be downloaded from OTH’s container registry.
- make sure you can reach OTH’s ECR. Address can be found in the service file for the service e.g.
/etc/systemd/system/clinician
- make sure you are using correct AWS credentials, provided by the OTH technical support.
- if you are using branding, make sure to check with OTH support that your container has been build properly.
Service can’t reach backend services
- check that you can reach rabbitmq, mysql and minio services from the application server. This can be done by issuing the command
nc <ip of the storage server> <service port>
- make sure the service is configured with correct credentials and hostname/ip address to reach the backend services.
- make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports.
Service can’t reach frontend
- check if you can resolve the domain name of your environment in dns on the application server e.g.
dig mytelehealth.solution.io
. Check if it points to the correct IP of the frontend server. - check if you can reach the frontend server from the application server with the domain name e.g.
curl -v https://mytelehealth.solution.io
. - make sure that traefik service is running on the application server.
- make sure that the HAPROXY service is running on the frontend server and are reachable on port 80 and 443/TCP from the outside.
- make sure the HAPROXY is configured with correct domain name and correct backend server reference to the application server.
- if using hostname for application server, make sure it is resolvable on the frontend server.
- make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports on the frontend server (allow 80,443/TCP) and the application server (allow 80/TCP from frontend server IP).
Service can’t reach other application services
- check if the traefik service is running on the application server.
- make sure that no local firewall like IPTABLES or ufw is misconfigured and not allowing traffic on the needed ports on the application server (allow 80/TCP from frontend server IP).
- if you have checked the above steps and corrected any errors, you should be able to reach other services now.
service dependencies
“idp2” and “audit” services are needed to be up and running, before any other services can start up correctly. Both “idp2” and “audit” services needs access to the RabbitMQ service and the Database service, in order for it to start up.
Common error codes
503
:
Service awaits other service(s) to be available:
Typically most services will return 503
if the idp2 service is unavailable. If multiple services are returning 503
, please check if idp2 service is running, before trying to troubleshoot further. If the idp2 service is not running properly, please try to get the idp2 service up and running before troubleshooting further on the other services. After the idp2 service is up correctly, most 503
alarms should disappear on their own shortly after.
URI endpoint doesn’t exists:
The HAPROXY will return a 503
if the URI endpoint is non-existent. Please ensure your URI is correct and its corresponding service exists and are up and running and reachable from the front-end node.
Migration scripts not yet complete:
This usually only happens right after an upgrade to a new release. On large data sets in the database, the migration-scripts can take several minutes to process and finish. The JSON body output from the clinician health endpoint might indicate that migration is in progress, poll the clinician health endpoint (e.g. https://mytelehealth-solution.com/clinician/health) as a first step when troubleshooting.
501
:
Service has for unknown reason ended in a state of internal error:
The most common cause for a service returning 501
is usually the lack of resources such as memory, disk and CPU. In rare cases the cause would be a actual bug in the code or supporting software components. Further investigation could be to look at monitoring metrics for the resources and the log files for the service to get a clearer view of what happened.
401
:
Wrong or expired credentials:
Commonly this would be experienced when calling services through the REST API. 401
indicates that authentication has failed and typically this would either be the result of a wrong API token or the use of a expired token.
403
:
Missing permissions:
Being authenticated, but lacking proper permissions to access resources would like end up in a 403
. Check if the user you are authenticated with have permission to access the resource or endpoint your are trying to reach.
404
:
Missing resource:
The resource you requested doesn’t exists or is in the process of starting up. Make sure your request is a valid endpoint.
Database lock (clinician):
If you see a “ERROR context.GrailsContextLoaderListener - Error initializing the application: Could not acquire change log lock.” error in the /var/log/clinician/clinician.log
the way to remedy this is to perform following steps:
- connect to the database
- do the following commands each followed by enter
use clinician;
# NOTE! If you are using legacy naming, use the dbname fromclinician:config:db:name:
in yoursettings.yml
instead ofclinician
.update DATABASECHANGELOGLOCK SET LOCKED=0,LOCKGRANTED=NULL,LOCKEDBY=NULL WHERE ID=1;
quit
- wait for the clinician service to start it’s run (do NOT restart the clinician service or the server). This can take up to 10 minutes.
- If it still fails, you need to contact support for further assistance.
Certificates
Common reason for SSL Certificates not working:
- Certificate has expired
- Missing CA and Intermediate Certificates in your configuration
- Revoked Certificate
- Issued to a mismatching domain name
Users can’t login
System related causes for a user not being able to login:
idp2 and/or audit service is not available:
Make sure that the idp2 and the audit services is running correctly.
clinician and client-citizen service is not fully up:
- If the admin or clinician users can’t login in general, the clinician service might not be fully up.
- If patients in general can’t login, it indicates that the client-citizen service is not fully up.
Database unavailable:
Users can be presented with login screen even though the clinician and client-citizen service can’t reach the back-end database endpoint. Please ensure the database is available and reachable from the services.
User related causes:
- user account doesn’t exists.
- user credential is wrong.
- user account is locked.
User related causes needs to be resolved by a platform administrator or clinician user, which is outside the scope of this manual. Please refer to the manuals for the platform.
Reporting a bug
Submitting a bug to OTH tech-support
When reporting a bug, the following things should be specified in the your request in order for us to be able to properly understand and reproduce the bug you are experiencing:
- The environment on which the error occurred, i.e. The URL of the server.
- Description of the expected result and the actual result.
- A step-by-step guide on how to reproduce the bug, if possible.
- Include a screenshot of the error message, if relevant.
- If the bug relates to a Bluetooth integration then include the model of the Android/iOS device + OS version and a picture of the Bluetooth device’s information, usually found on the back of the device.
- If the bug occurs when trying to complete a questionnaire in the patient app please include the JSON export of the questionnaire containing the error, as an attachment to the email.
To report a bug, submit a request through https://opentelehealth.zendesk.com
Contacting support
Please direct any questions or issues related to an OTH installation to OTH technical support.
- Provide the URL of the server in which the issue retains to.
- Provide as detailed a description as possible of the nature of the issue.
- Please be mindful of deadlines/execution times concerning branding, upgrades etc.
- For larger issues like migration, projects, etc. Please direct this to your key-account manager/project manager for initial planning.
OTH support can be reached through https://opentelehealth.zendesk.com