Monitoring: Roles & Responsibilties
The Monitoring staff is responsible for the implementation of the product, updates/upgrades, management of the central monitoring service, and providing best monitoring practices for the software administration staff administering services for their areas.
Monitoring staff provides support to customers.
Support requests should be emailed to the firstname.lastname@example.org email address.
Requests will be responded to within two hours during normal business hours (Monday through Friday, 7 a.m. to 5 p.m.), and “Best Effort” outside of business hours. Emergency changes will be implemented on a best-effort basis.
Monitoring staff will work with the OIT Linux team, system DBAs, and software administrators to provide the monitoring service. Configuration and maintenance of the service will require collaboration between Monitoring staff and the teams responsible for the monitored targets.
|Monitoring server installation and configuration||Monitoring staff|
|Monitoring server maintenance||Monitoring staff|
|Monitoring agent installation and configuration||OIT Sysadmin Team|
|Firewall configuration for agents||OIT Sysadmin Team|
|PCI Monitoring agent config||OIT Sysadmin Team|
|Monitoring target configuration, maintenance, and troubleshooting||Customer|
|Monitoring template standardization||Customer|
|Monitoring template maintenance||Customer|
|Monitoring account creation||Monitoring staff|
|Monitoring account configuration||Customer|
|Email and pager configuration||Customer|
|Monitoring server troubleshooting||Monitoring staff|
|Monitoring agent troubleshooting||OIT Sysadmin Team and Monitoring staff|
|Service downtime||Monitoring staff|
In the event of a disaster, recovering monitoring services will be done in a manner consistent within the scope of the disaster. In the event of a full Data Center outage, Monitoring services may be subject to a “first off, last on” strategy. Redundancies are built into the infrastructure to minimize outages and to assure that service is restored as quickly as possible in the event of a disaster.
Degraded or failed service will receive the attention and resources necessary to achieve a multiple-hour RTO.
In the event of unexpected service interruption, OIT will update the System Status and send notification of a service interruption to individuals subscribed to the email@example.com within 15 minutes of when the Monitoring staff identifies a service loss. Status updates will be provided on an hourly basis to both System Status and the list. Postmortem results will be released after the resolution of major interruptions.