Notice

This document is for a development version of Ceph.

CRUSH MSR (Multi-step Retry)

Motivation

Conventional CRUSH has an important limitation: rules with multiple choose steps which hit an out osd cannot retry prior steps. As an example, with a rule like

rule replicated_rule_1 {
    ...
    step take default class hdd
    step chooseleaf firstn 3 type host
    step emit
}

one might expect that if all of the OSDs on a particular host are marked out, mappings including those OSDs would end up on another host (provided that there are enough hosts). Indeed, that’s what will happen. Moreover, if 1/8 OSDs on a host are marked out, roughly 1/8 of the PGs mapped to that host will end up remapped to some other host keeping overall per-OSD utilization even.

Suppose, instead, the rule were written like this:

rule replicated_rule_1 {
    ...
    step take default class hdd
    step choose firstn 3 type host
    step choose firstn 1 type osd
    step emit
}

The behavior would be very similar as long as no OSDs are marked out. However, if an OSD is marked out, any PGs mapped to that OSD will be remapped to other OSDs on the same host resulting in those OSDs being over-utilized relative to OSDs on other hosts. Moreover, if all of the OSDs on a host are marked out, mappings that happen to hit that host will fail resulting in undersized PGs.

As long as the goal is to split N OSDs between N failure domains, the solution is simply to use the chooseleaf variant above. However, consider a use case where we want to split an 8+6 EC encoding over 4 hosts in order to tolerate the loss of a host and an OSD on another host with 1.75x storage overhead. The rule would have to look something like:

rule ecpool-86 {
    ...
    step take default class hdd
    step choose indep 4 type host
    step choose indep 4 type osd
    step emit
}

This does split up to 16 OSDs between 4 hosts (with an 8+6 code, it would put 4 OSDs on each of the first 3 and 2 on the last) and meets our failure requirements. However, for the reasons outlined above, it will behave poorly as OSDs are marked out if there are other hosts to rebalance to. chooseleaf is not a solution here because it does not support mapping more than one leaf below the specified type.

MSR

CRUSH MSR (Multi-step Retry) rules solve the above problem by using a different descent algorithm which retries all of the steps upon hitting an out OSD. Where classic CRUSH is breadth first (for each step, it fully populates the vector before proceeding to the next step), MSR rules are depth first -- for each choice, we recursively descend through all of the steps before continuing with the next choice. The above use case can be satisfied with the following rule:

rule ecpool-86 {
    type msr_indep
    ...
    step take default class hdd
    step choosemsr 4 type host
    step choosemsr 4 type osd
    step emit
}

As with the chooseleaf example at the top, as OSDs are marked out, those OSDs are be remapped proportionately to other hosts so long as there are extras available. For details on how that works while still preserving failure domain isolation, see the comments in mapper.c:crush_msr_choose.

Rule Structure

CRUSH MSR rules are crush rules with type CRUSH_RULE_TYPE_MSR_FIRSTN or CRUSH_RULE_TYPE_MSR_INDEP (see mapper.c: rule_type_is_msr). Unlike with classic crush rules, individual steps do not specify firstn or indep. The output order is instead defined by the rule type for the whole rule.

MSR rules have some structural differences from conventional rules:

  • The rule type determines whether the mapping is FIRSTN or INDEP. Because the descent can retry steps, it doesn’t really make sense for steps to individually specify output order and I’m not really aware of any use cases that would benefit from it.

  • MSR rules must be structured as a (possibly empty) prefix of config steps (CRUSH_RULE_SET_CHOOSE_MSR*) followed by a sequence of EMIT blocks each comprised of a TAKE step, a sequence of CHOOSE_MSR steps, and ended by an EMIT step.

  • MSR steps must be choosemsr. choose and chooseleaf are not permitted.

Working Space

MSR rules also have different requirements for working space. Conventional CRUSH requires 3 vectors of size result_max to use for working space -- two to alternate as it processes each rule and one, additionally, for chooseleaf. MSR rules need N vectors where N is the number of choosemsr steps in the longest EMIT block since it needs to retain all of the choices made as part of each descent.

See mapper.h/c:crush_work_size, crush_msr_scan_rule for details.

Implementation

mapper.h/c:crush_do_rule internally branches to mapper.c:crush_msr_do_rule for rules of type CRUSH_RULE_TYPE_MSR_* (see mapper.c:rule_type_is_msr).

MSR related functions in mapper.c are annotated with more details about the algorithm.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.