Operations and Deployment: Difference between revisions

From ReddNet
Jump to navigation Jump to search
No edit summary
Line 33: Line 33:


* Gain experience with existing deployment  
* Gain experience with existing deployment  
** See "Validation Framework" below.
** make frequent reports in weekly REDDnet meetings
** review deployment experience end of March


* Proposed multi-tiered system for sites for discussion
* Proposed multi-tiered system for sites for discussion
Line 40: Line 43:
** Develop MOU for each tier
** Develop MOU for each tier


* Investigate new monitoring and management tools
* Investigate new management tools over the next two months
** rsync or similar (short term)
** rsync or similar (short term)
** Perceus (long term)
** Perceus (long term)
Line 47: Line 50:
* Use StorCore, Nagios, iperf, and visualization tools from SC07  
* Use StorCore, Nagios, iperf, and visualization tools from SC07  
** Have a statistic page that gathers information from tests and presents them cleanly
** Have a statistic page that gathers information from tests and presents them cleanly
** Define support for REDDnet
*** Long term project - 6 months to a year?
 
* What is required to provide adequate support for REDDnet
** Want feedback on this.
** Setting expectations?
** Actively talk to users over the next couple of months - 6 months - ongoing
** Develop an initial plan by mid-Feb, then review every 3 months and tweak


* Create a REDDnet status site, using google maps
* Create a REDDnet status site, using google maps
** short term just get green dots on a map (what is the priority for this?)
** longer term - expected to evolve, integrate with the vis site


* Create an RT site to resolve users' issues
* Create an RT site to resolve users' issues
** needs to happen quickly.  mid-Feb.


=== Validation Framework ===
=== Validation Framework ===
* Stress and WAN testing on Production REDDnet
* Stress and WAN testing on Production REDDnet
** Automated testing with Clyde
** Automated testing with Clyde
** Real world use  
*** excercise system prior to heavy real world use
*** unfortunately longer term - first of April?
*** this testing will move to test deployment eventually
** Real world use (happening now, although not heavy)


* QA testing on Test REDDnet required before moving into production REDDnet
* QA testing on Test REDDnet required before moving into production REDDnet
** A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
** A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
** Allow users to test using this system
** Allow users to test using this system

Revision as of 11:04, 28 January 2008

Deployment

  • Bring current deployment up-to-date (estimate complete by mid-Feb)
    • Build new image for 2GB internal USB memory
      • may be done by All-Hands
    • Design and implement a depot recovery process
      • This process will be vetted on current deployed hardware
      • Initial process may be done by All Hands or soon thereafter
      • Begin with SFASU for initial vetting
    • Send recovery keys out to sites and update the depots
      • early February
      • need to recruit one person at each site to assist
    • Set Nagios back up
      • need a more stable system before this makes sense
      • turned off during this transition period to avoid flood of diagnostics
      • needs a day or two of time, could be done now depending on priority
  • Prepare additional existing hardware for deployment (19 nodes)
    • Update image on internal USB (will use for testing of the above recovery process)
    • Send 6 depots to SFASU with additional PDU
    • Find new collaborators/sites for remaining 13.
      • Ideas?
  • Develop MOU for current deployment (timescale uncertain?)
    • Longer term project
  • Define a standard set of software tools for depots
    • The following also exist
      • Iperf
      • Nagios
      • mtr
    • other tools to be added?
      • investigating new tools now
  • Gain experience with existing deployment
    • See "Validation Framework" below.
    • make frequent reports in weekly REDDnet meetings
    • review deployment experience end of March
  • Proposed multi-tiered system for sites for discussion
    • Tier 1: Sites that run their own LServer and Chord ring
    • Tier 2: Sites that manage their own REDDnet depots
    • Tier 3: Sites that use their own storage resources as depots
    • Develop MOU for each tier
  • Investigate new management tools over the next two months
    • rsync or similar (short term)
    • Perceus (long term)

Monitoring

  • Use StorCore, Nagios, iperf, and visualization tools from SC07
    • Have a statistic page that gathers information from tests and presents them cleanly
      • Long term project - 6 months to a year?
  • What is required to provide adequate support for REDDnet
    • Want feedback on this.
    • Setting expectations?
    • Actively talk to users over the next couple of months - 6 months - ongoing
    • Develop an initial plan by mid-Feb, then review every 3 months and tweak
  • Create a REDDnet status site, using google maps
    • short term just get green dots on a map (what is the priority for this?)
    • longer term - expected to evolve, integrate with the vis site
  • Create an RT site to resolve users' issues
    • needs to happen quickly. mid-Feb.

Validation Framework

  • Stress and WAN testing on Production REDDnet
    • Automated testing with Clyde
      • excercise system prior to heavy real world use
      • unfortunately longer term - first of April?
      • this testing will move to test deployment eventually
    • Real world use (happening now, although not heavy)
  • QA testing on Test REDDnet required before moving into production REDDnet
    • A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
    • Allow users to test using this system