Deployment: Current state, plans for next 6 months, issues raised.
Jump to navigation
Jump to search
- sites that are up except:
- umich - 4 sites up, 8 not. need to coordinate with local person, waiting now for that. they are having switching problems, dying hardware, (mis-configured?).
- need to request an IT contact
- had to buy a new KVM - that turned out not to be the problem, but a mis-configured switch there. extra KVM is at Vandy, so we have a spare
- need a technical contact/site user level agreement for what is required.
- UFL and UMich are only places without an IT contact, and this may be what we have at these places.
- deploying latest depot code on all depots as we speak.
- Oak Ridge - not brought up yet.
- productive meeting end of March
- we will be put outside ORNL's perimeter
- same area that TeraGrid hardware is
- so we could be part of TeraGrid...
- Dan has some ideas for tests to run across the TeraGrid network
- and John will integrate with his activities
- communication between David Giles (at ORNL) and Bobby to get those boxes online. John is ultimately responsible for anything that happens but Bobby and David have a plan that he accepts.
- David will have SUDO persmissions account on the depot hardware, Bobby will start this as soon as Michigan is done
- New sites?
- First one below is solid, second has been offered, the rest...???
- UTK - Gerald R.
- LoC - if they want...
- AmericaView sites? Wisconsin? Alaska?
- Meeting of the executive committee?
- Further depot deployments should be driven by applications...
- OSG?
- SuraGrid
- deployment/upgrade experience from recent upgrades, plans for future
- management node, with an implementation of pexec for deploying software in parallel.
- run p commands to push out software updates - os update, depot updates.
- operating system is on a USB key on depots, which means that the OS is different than the data drives, which makes M&O easy.
- need to write up a wiki page - document how things are configured - so that there can be uniformity for others who might want to opt in. (???) describe main issue - that os drive should be separate from data drives, other philosophies in use, etc.
- procedure for upgrading depot
- shutdown the depot code
- copy over the new code
- start back up
- can all be done with only a short time, and no loss of data.
- monitoring, user info,...
- nagios will come back online once umich site back online
- nagios is not required to run on each of the boxes, but can collect info from many nodes.
- nagios is designed to work across the wide area.
- helps catch problems early
- visualization - use part of what we had from SC07 on main reddnet page to allow users to get status of a particular site, see location of sites, show data movement.
- "Is the depot up and responding" health test ala OSG?
- want some sort of weathermap or OSG style site status
- if site is failing, it is taken "offline"