Operations and Deployment

From ReddNet
Jump to navigation Jump to search

Deployment

  • Bring current deployment up-to-date (estimate complete by mid-Feb)
    • Build new image for 2GB internal USB memory
      • may be done by All-Hands
    • Design and implement a depot recovery process
      • This process will be vetted on current deployed hardware
      • Initial process may be done by All Hands or soon thereafter
      • Begin with SFASU for initial vetting
    • Send recovery keys out to sites and update the depots
      • early February
      • need to recruit one person at each site to assist
    • Set Nagios back up
      • need a more stable system before this makes sense
      • turned off during this transition period to avoid flood of diagnostics
      • needs a day or two of time, could be done now depending on priority
  • Prepare additional existing hardware for deployment (19 nodes)
    • Update image on internal USB (will use for testing of the above recovery process)
    • Send 6 depots to SFASU with additional PDU
    • Find new collaborators/sites for remaining 13.
      • Ideas?
  • Develop MOU for current deployment (timescale uncertain?)
    • Longer term project
  • Define a standard set of software tools for depots
    • The following also exist
      • Iperf
      • Nagios
      • mtr
    • other tools to be added?
      • investigating new tools now
  • Gain experience with existing deployment
    • See "Validation Framework" below.
    • make frequent reports in weekly REDDnet meetings
    • review deployment experience end of March
  • Proposed multi-tiered system for sites for discussion
    • Tier 1: Sites that run their own LServer and Chord ring
    • Tier 2: Sites that manage their own REDDnet depots
    • Tier 3: Sites that use their own storage resources as depots
    • Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
    • Develop MOU for each tier - we need to be able to supply reliable storage.
  • Needs
    • What resources
    • What policies
    • What management
  • Investigate new management tools over the next two months
    • rsync or similar (short term)
    • Perceus (long term)

Monitoring

  • Use StorCore, Nagios, iperf, and visualization tools from SC07
    • Have a statistic page that gathers information from tests and presents them cleanly
      • Long term project - 6 months to a year?
  • What is required to provide adequate support for REDDnet
    • Want feedback on this.
    • Setting expectations?
    • Actively talk to users over the next couple of months - 6 months - ongoing
    • Develop an initial plan by mid-Feb, then review every 3 months and tweak
  • Create a REDDnet status site, using google maps
    • short term just get green dots on a map (what is the priority for this?)
    • longer term - expected to evolve, integrate with the vis site
  • Create an RT site to resolve users' issues
    • needs to happen quickly. mid-Feb.

Validation Framework

  • Stress and WAN testing on Production REDDnet
    • Automated testing with Clyde
      • excercise system prior to heavy real world use
      • unfortunately longer term - first of April?
      • this testing will move to test deployment eventually
    • Real world use (happening now, although not heavy)
  • QA testing on Test REDDnet required before moving into production REDDnet
    • A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
    • Allow users to test using this system