Deployment: Current state, plans for next 6 months, issues raised.

From ReddNet
Revision as of 08:36, 21 April 2008 by Sheldonp (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  • sites that are up except:
    • umich - 4 sites up, 8 not. need to coordinate with local person, waiting now for that. they are having switching problems, dying hardware, (mis-configured?).
    • need to request an IT contact
    • had to buy a new KVM - that turned out not to be the problem, but a mis-configured switch there. extra KVM is at Vandy, so we have a spare
  • need a technical contact/site user level agreement for what is required.
  • UFL and UMich are only places without an IT contact, and this may be what we have at these places.
  • deploying latest depot code on all depots as we speak.
  • Oak Ridge - not brought up yet.
    • productive meeting end of March
    • we will be put outside ORNL's perimeter
    • same area that TeraGrid hardware is
    • so we could be part of TeraGrid...
    • Dan has some ideas for tests to run across the TeraGrid network
    • and John will integrate with his activities
    • communication between David Giles (at ORNL) and Bobby to get those boxes online. John is ultimately responsible for anything that happens but Bobby and David have a plan that he accepts.
    • David will have SUDO persmissions account on the depot hardware, Bobby will start this as soon as Michigan is done
  • New sites?
    • First one below is solid, second has been offered, the rest...???
    • UTK - Gerald R.
    • LoC - if they want...
    • AmericaView sites? Wisconsin? Alaska?
    • Meeting of the executive committee?
    • Further depot deployments should be driven by applications...
    • OSG?
    • SuraGrid
  • deployment/upgrade experience from recent upgrades, plans for future
    • management node, with an implementation of pexec for deploying software in parallel.
    • run p commands to push out software updates - os update, depot updates.
    • operating system is on a USB key on depots, which means that the OS is different than the data drives, which makes M&O easy.
    • need to write up a wiki page - document how things are configured - so that there can be uniformity for others who might want to opt in. (???) describe main issue - that os drive should be separate from data drives, other philosophies in use, etc.
    • procedure for upgrading depot
      • shutdown the depot code
      • copy over the new code
      • start back up
      • can all be done with only a short time, and no loss of data.
  • monitoring, user info,...
    • nagios will come back online once umich site back online
    • nagios is not required to run on each of the boxes, but can collect info from many nodes.
    • nagios is designed to work across the wide area.
    • helps catch problems early
    • visualization - use part of what we had from SC07 on main reddnet page to allow users to get status of a particular site, see location of sites, show data movement.
    • "Is the depot up and responding" health test ala OSG?
    • want some sort of weathermap or OSG style site status
    • if site is failing, it is taken "offline"