Monitor

Code and View updates for sites/index.php

This work is in preparation for asking sites to update their PCU information for Monitor.

I've created a better view for sites/index.php. It lists all nodes and allows techs or PIs to associate nodes with a PCU.

Additionally, the organization of the code is split more clearly between a template (standard html with minimal php variables, and loops) and control function.

I believe this will make it easier to integrate with additional ajax code later. I will request OneLab for comments on the code reorganization, using sites/index.php as an example.

 

Monitor Suspended

Monitor has been suspended this week, due to operational setbacks. At the beginning of the week the database server was experiencing extreme load, and resulting in timeouts or failed boots by remote sites, complicating support for Monitor tickets. Rather than generate additional traffic, I postponed running monitor again.

The mis-match of the PlanetLab-branch.tar.gz bundle that's downloaded during a 'rins' was fixed for I2 nodes (of which there are many), so that after a rins 'codemux' is correctly installed. It's not clear how this happened other than just a stale version of PlanetLab-branch.tar.gz that didn't get updated to the 4.1 version.

Monitor Stats Overview

Let Monitor loose on over 200 nodes this morning. Nodes in some states are not acted on: debug, down < 7 days, with a PCU.

There were 60 sites with unresponded tickets before, and now there are 97 tickets in RT actively managed by Monitor, representing outstanding issues with a Site not just a node.

sites_today: 52
nodes_today: 100

sites_total : 97
nodes_total : 177
nodes restored since beginning: 41

Site Assist and Monitor

I've added Site Assistant docs that describe some of the features/policy currently implemented in Monitor as well as what will come from Monitor in the days to come.

July 16-20, 2007

July 16-20,2007
  • Second run of Monitor for subset of 80 down nodes.
    • 13 sites/22 nodes notified that their nodes are down.
    • 9 sites/14 nodes notified & squeezed due to non-response about BootCDs.
    • 77 sites/ 134 nodes (45 up/89 down) are exempted from squeezing due to META-Site status.
  • Investigating stricter categorization based on Comon data
  • Further development for automatic squeeze-backoff, RT resolution, and web interface.

July 2-6th, 2007

  • Prepare for ROADs
  • First run of Monitor where the email to Techs actually work...
    • Messages sent to techs are passed through RT, keeping a record and catching bounces, autoreplies, or away messages.
    • RT Queue is not visible to most others on RT for now. Don't want to junk up their summary page.
    • Mailing list for all Monitor Queue traffic is at Monitor.
    • The first round only checked for old BootCDs of nodes in debug mode, or actually running.

June 29, 2007

  • Updated TechGuide for BootCD preparation. This will hopefully stave off the wave of support requests that will be generated by the V2.0 upgrade messages for down nodes by Monitor.
  • Discovered the tech- alias but in gen_aliases.py . This evidently prevented my first round of messages from going out, meaning that I'll bestarting from zero on, Monday, July 2nd. This should get everyone's attention right away.
  • Started a discussion on PLDevel for a replacement of the old admGetConfFile() API and what belongs in the API vs what doesn't.

Faiyaz pointed out that the NotifyPersons() api call may be better suitted to sending Email to Techs/PIs etc, than the EMail aliases anyway. I will look into that after the first round.

June 15, 2007

Monitor

I've added a 'monitor' queue to RT support list in order to catch bounced messages and be a central list for the correspondence with site managers. This is currently only visible to me in the RT GUI.

Finishing the monitor code for debug and down sites. The processing moved from node-based to site-based, grouping all complaints into a single email in order to be more friendly. I've added blacklists to prevent certain sites from every being considered regardless of whether they are RT tickets or not. As well the code differentiates clearly from the 'diagnosis' of a node's problem and the 'action' taken on it.

Syndicate content