Operations and Deployment: Difference between revisions

From ReddNet

Jump to navigation Jump to search

Latest revision as of 11:04, 1 February 2008

Deployment

Bring current deployment up-to-date (estimate complete by mid-Feb)
- Build new image for 2GB internal USB memory
  - may be done by All-Hands
- Design and implement a depot recovery process
  - This process will be vetted on current deployed hardware
  - Initial process may be done by All Hands or soon thereafter
  - Begin with SFASU for initial vetting
- Send recovery keys out to sites and update the depots
  - early February
  - need to recruit one person at each site to assist
- Set Nagios back up
  - need a more stable system before this makes sense
  - turned off during this transition period to avoid flood of diagnostics
  - needs a day or two of time, could be done now depending on priority

Prepare additional existing hardware for deployment (19 nodes)
- Update image on internal USB (will use for testing of the above recovery process)
- Send 6 depots to SFASU with additional PDU
- Find new collaborators/sites for remaining 13.
  - Ideas?

Develop MOU for current deployment (timescale uncertain?)
- Longer term project

Define a standard set of software tools for depots
- The following also exist
  - Iperf
  - Nagios
  - mtr
- other tools to be added?
  - investigating new tools now

Gain experience with existing deployment
- See "Validation Framework" below.
- make frequent reports in weekly REDDnet meetings
- review deployment experience end of March

Proposed multi-tiered system for sites for discussion
- Tier 1: Sites that run their own LServer and Chord ring
- Tier 2: Sites that manage their own REDDnet depots
- Tier 3: Sites that use their own storage resources as depots
- Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
- Develop MOU for each tier - we need to be able to supply reliable storage.

Needs
- What resources
- What policies
- What management

Investigate new management tools over the next two months
- rsync or similar (short term)
- Perceus (long term)

Monitoring

Use StorCore, Nagios, iperf, and visualization tools from SC07
- Have a statistic page that gathers information from tests and presents them cleanly
  - Long term project - 6 months to a year?

What is required to provide adequate support for REDDnet
- Want feedback on this.
- Setting expectations?
- Actively talk to users over the next couple of months - 6 months - ongoing
- Develop an initial plan by mid-Feb, then review every 3 months and tweak

Create a REDDnet status site, using google maps
- short term just get green dots on a map (what is the priority for this?)
- longer term - expected to evolve, integrate with the vis site

Create an RT site to resolve users' issues
- needs to happen quickly. mid-Feb.

Validation Framework

Stress and WAN testing on Production REDDnet
- Automated testing with Clyde
  - excercise system prior to heavy real world use
  - unfortunately longer term - first of April?
  - this testing will move to test deployment eventually
- Real world use (happening now, although not heavy)

QA testing on Test REDDnet required before moving into production REDDnet
- A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
- Allow users to test using this system

Retrieved from "http://www.reddnet.org/mwiki/index.php?title=Operations_and_Deployment&oldid=3244"

@@ Line 1: / Line 1: @@
-= Deployment =
+=== Deployment ===
-* Bring current deployment up-to-date
+* '''Bring current deployment up-to-date (estimate complete by mid-Feb)'''
-* Prepare existing hardware for deployment
+** Build new image for 2GB internal USB memory
-* Develop MOU for current deployment
+*** may be done by All-Hands
-* Design and implement a depot recovery process
+** Design and implement a depot recovery process
+*** This process will be vetted on current deployed hardware
+*** Initial process may be done by All Hands or soon thereafter
+*** Begin with SFASU for initial vetting
+** Send recovery keys out to sites and update the depots
+*** early February
+*** need to recruit one person at each site to assist
+** Set Nagios back up
+*** need a more stable system before this makes sense
+*** turned off during this transition period to avoid flood of diagnostics
+*** needs a day or two of time, could be done now depending on priority
+* '''Prepare additional existing hardware for deployment (19 nodes)'''
+** Update image on internal USB (will use for testing of the above recovery process)
+** Send 6 depots to SFASU with additional PDU
+** Find new collaborators/sites for remaining 13.
+*** Ideas?
+* Develop MOU for current deployment (timescale uncertain?)
+** Longer term project
 * Define a standard set of software tools for depots
-* Gain experience with existing deployment
+** The following also exist
-* Discuss a multi-tiered system for sites
+*** Iperf
-* Find new collaborators/sites
+*** Nagios
-* Investigate Perceus as an update tool
+*** mtr
+** other tools to be added?
+*** investigating new tools now
+* Gain experience with existing deployment
+** See "Validation Framework" below.
+** make frequent reports in weekly REDDnet meetings
+** review deployment experience end of March
+* '''Proposed multi-tiered system for sites for discussion'''
+** Tier 1: Sites that run their own LServer and Chord ring
+** Tier 2: Sites that manage their own REDDnet depots
+** Tier 3: Sites that use their own storage resources as depots
+** Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
+** Develop MOU for each tier - we need to be able to supply reliable storage.
+* Needs
+** What resources
+** What policies
+** What management
+* Investigate new management tools over the next two months
+** rsync or similar (short term)
+** Perceus (long term)
+=== Monitoring ===
+* '''Use StorCore, Nagios, iperf, and visualization tools from SC07'''
+** Have a statistic page that gathers information from tests and presents them cleanly
+*** Long term project - 6 months to a year?
+* '''What is required to provide adequate support for REDDnet'''
+** Want feedback on this.
+** Setting expectations?
+** Actively talk to users over the next couple of months - 6 months - ongoing
+** Develop an initial plan by mid-Feb, then review every 3 months and tweak
-= Monitoring =
-* Use StorCore, Nagios, iperf, and visualization tools from SC07
 * Create a REDDnet status site, using google maps
-* Create an RT site to resolve users' issues
+** short term just get green dots on a map (what is the priority for this?)
+** longer term - expected to evolve, integrate with the vis site
+* '''Create an RT site to resolve users' issues'''
+** needs to happen quickly.  mid-Feb.
+=== Validation Framework ===
+* '''Stress and WAN testing on Production REDDnet'''
+** Automated testing with Clyde
+*** excercise system prior to heavy real world use
+*** unfortunately longer term - first of April?
+*** this testing will move to test deployment eventually
+** Real world use (happening now, although not heavy)
-= Validation Framework =
-* Stress and WAN testing on Production REDDnet
 * QA testing on Test REDDnet required before moving into production REDDnet
+** A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
+** Allow users to test using this system

Operations and Deployment: Difference between revisions

Latest revision as of 11:04, 1 February 2008

Deployment

Monitoring

Validation Framework

Navigation menu

Search