Operations and Deployment

From ReddNet
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Deployment

  • Bring current deployment up-to-date (estimate complete by mid-Feb)
    • Build new image for 2GB internal USB memory
      • may be done by All-Hands
    • Design and implement a depot recovery process
      • This process will be vetted on current deployed hardware
      • Initial process may be done by All Hands or soon thereafter
      • Begin with SFASU for initial vetting
    • Send recovery keys out to sites and update the depots
      • early February
      • need to recruit one person at each site to assist
    • Set Nagios back up
      • need a more stable system before this makes sense
      • turned off during this transition period to avoid flood of diagnostics
      • needs a day or two of time, could be done now depending on priority
  • Prepare additional existing hardware for deployment (19 nodes)
    • Update image on internal USB (will use for testing of the above recovery process)
    • Send 6 depots to SFASU with additional PDU
    • Find new collaborators/sites for remaining 13.
      • Ideas?
  • Develop MOU for current deployment (timescale uncertain?)
    • Longer term project
  • Define a standard set of software tools for depots
    • The following also exist
      • Iperf
      • Nagios
      • mtr
    • other tools to be added?
      • investigating new tools now
  • Gain experience with existing deployment
    • See "Validation Framework" below.
    • make frequent reports in weekly REDDnet meetings
    • review deployment experience end of March
  • Proposed multi-tiered system for sites for discussion
    • Tier 1: Sites that run their own LServer and Chord ring
    • Tier 2: Sites that manage their own REDDnet depots
    • Tier 3: Sites that use their own storage resources as depots
    • Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
    • Develop MOU for each tier - we need to be able to supply reliable storage.
  • Needs
    • What resources
    • What policies
    • What management
  • Investigate new management tools over the next two months
    • rsync or similar (short term)
    • Perceus (long term)

Monitoring

  • Use StorCore, Nagios, iperf, and visualization tools from SC07
    • Have a statistic page that gathers information from tests and presents them cleanly
      • Long term project - 6 months to a year?
  • What is required to provide adequate support for REDDnet
    • Want feedback on this.
    • Setting expectations?
    • Actively talk to users over the next couple of months - 6 months - ongoing
    • Develop an initial plan by mid-Feb, then review every 3 months and tweak
  • Create a REDDnet status site, using google maps
    • short term just get green dots on a map (what is the priority for this?)
    • longer term - expected to evolve, integrate with the vis site
  • Create an RT site to resolve users' issues
    • needs to happen quickly. mid-Feb.

Validation Framework

  • Stress and WAN testing on Production REDDnet
    • Automated testing with Clyde
      • excercise system prior to heavy real world use
      • unfortunately longer term - first of April?
      • this testing will move to test deployment eventually
    • Real world use (happening now, although not heavy)
  • QA testing on Test REDDnet required before moving into production REDDnet
    • A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
    • Allow users to test using this system