www.icesr.com
IT运维工程师的摇篮

Mesos原理与代码分析(2): Mesos-Master的启动

Mesos-Master的启动在src/master/main.cpp文件中进行

1. master::Flags flags 解析命令行参数和环境变量

https://mesosphere.com/blog/2015/05/14/using-stout-to-parse-command-line-options/
封装了Google的gflags来解析命令行参数和环境变量


2. process::firewall::install(move(rules));如果有参数–firewall_rules则会添加规则







3. ModuleManager::load(flags.modules.get())如果有参数–modules或者–modules_dir=dirpath,则会将路径中的so文件load进来




Mesos Modules的配置




都可以写什么Module


Allocator

默认是内置的Hierarchical Dominant Resource Fairness allocator

要写一个自己的Allocator:

  • 通过–modules加载so
  • 通过–allocator参数指定

Hook

The available hooks API is defined in mesos/hook.hpp and for each hook defines the insertion point and available context.

To load a hook into Mesos, you need to:

  • introduce it to Mesos by listing it in the –modules configuration,
  • select it as a hook module via the –hooks flag.

./bin/mesos-agent.sh –master=<IP>:<PORT> –modules="file://<path-to-modules-config>.json" –hooks=TestTaskHook


class Hook




Isolator


Isolator modules enable experimenting with specialized isolation and monitoring capabilities. Examples of these could be 3rdparty resource isolation mechanisms for GPU hardware, networking, etc.


–isolation=VALUE

Isolation mechanisms to use, e.g., posix/cpu,posix/mem, or cgroups/cpu,cgroups/mem, or network/port_mapping (configure with flag: –with-network-isolator to enable), or `gpu/nvidia` for nvidia specific gpu isolation, or external, or load an alternate isolator
module using the –modules flag. Note that this flag is only relevant for the Mesos Containerizer. (default: posix/cpu,posix/mem)


所有的Isolator注册到一个vector里面。

 // Create the isolators for the MesosContainerizer.
  const hashmap<string, lambda::function<Try<Isolator*>(const Flags&)>>
    creators = {
    // Filesystem isolators.
    {"filesystem/posix", &PosixFilesystemIsolatorProcess::create},
#ifdef __linux__
    {"filesystem/linux", &LinuxFilesystemIsolatorProcess::create},
    // TODO(jieyu): Deprecate this in favor of using filesystem/linux. 
    {"filesystem/shared", &SharedFilesystemIsolatorProcess::create},
#endif
    // Runtime isolators.
    {"posix/cpu", &PosixCpuIsolatorProcess::create},
    {"posix/mem", &PosixMemIsolatorProcess::create},
    {"posix/disk", &PosixDiskIsolatorProcess::create},
#ifdef __linux__
    {"cgroups/cpu", &CgroupsCpushareIsolatorProcess::create},
    {"cgroups/mem", &CgroupsMemIsolatorProcess::create},
    {"cgroups/net_cls", &CgroupsNetClsIsolatorProcess::create},
    {"cgroups/perf_event", &CgroupsPerfEventIsolatorProcess::create},
    {"docker/runtime", &DockerRuntimeIsolatorProcess::create},
    {"namespaces/pid", &NamespacesPidIsolatorProcess::create},
#endif
#ifdef WITH_NETWORK_ISOLATOR
    {"network/port_mapping", &PortMappingIsolatorProcess::create},
#endif
  };
  vector<Owned<Isolator>> isolators;


运行一个Mesos Container的时候,除了fork,调用isolate


Future<bool> MesosContainerizerProcess::__launch(
    const ContainerID& containerId,
    const ExecutorInfo& executorInfo,
    const string& directory,
    const Option<string>& user,
    const SlaveID& slaveId,
    const PID<Slave>& slavePid,
    bool checkpoint,
    const list<Option<ContainerLaunchInfo>>& launchInfos)
{
……
Try<pid_t> forked = launcher->fork(
        containerId,  path::join(flags.launcher_dir, MESOS_CONTAINERIZER),  argv, Subprocess::FD(STDIN_FILENO), (local ? Subprocess::FD(STDOUT_FILENO): Subprocess::IO(subprocessInfo.out)), (local ? Subprocess::FD(STDERR_FILENO): Subprocess::IO(subprocessInfo.err)),
launchFlags, environment, None(), namespaces);
……
return isolate(containerId, pid)
      .then(defer(self(), &Self::fetch, containerId, executorInfo.command(), directory, user, slaveId))
      .then(defer(self(), &Self::exec, containerId, pipes[1]))
      .onAny(lambda::bind(&os::close, pipes[0]))
      .onAny(lambda::bind(&os::close, pipes[1]));


Future<bool> MesosContainerizerProcess::isolate(
    const ContainerID& containerId,
    pid_t _pid)
{
  containers_[containerId]->state = ISOLATING;

  // Set up callbacks for isolator limitations.
  foreach (const Owned<Isolator>& isolator, isolators) {
    isolator->watch(containerId)
      .onAny(defer(self(), &Self::limited, containerId, lambda::_1));
  }

  // Isolate the executor with each isolator.
  // NOTE: This is done is parallel and is not sequenced like prepare
  // or destroy because we assume there are no dependencies in
  // isolation.
  list<Future<Nothing>> futures;
  foreach (const Owned<Isolator>& isolator, isolators) {
    futures.push_back(isolator->isolate(containerId, _pid));
  }

  // Wait for all isolators to complete.
  Future<list<Nothing>> future = collect(futures);
  containers_[containerId]->isolation = future;
  return future.then([]() { return true; });
}




Master Contender and Detector


Contender and Detector modules enable developers to implement custom leader election and master detection mechanisms, other than relying on Zookeeper by default.


To load custom contender and detector module, you need to:
Supply –modules when running Mesos master,
Specify selected contender and detector modules with –master_contender and –master_detector flags on Mesos Master and –master_detector on Mesos Slave.


4. HookManager::initialize(flags.hooks.get())

如果有参数–hooks,则加载hook




5. Create an instance of allocator.




Mesos源码中默认的Allocator,即HierarchicalDRFAllocator的位置在$MESOS_HOME/src/master/allocator/mesos/hierarchical.hpp,而DRF中对每个Framework排序的Sorter位于$MESOS_HOME/src/master/allocator/sorter/drf/sorter.cpp,可以查看其源码了解它的工作原理。


HierarchicalDRF


如何作出offer分配的决定是由资源分配模块Allocator实现的,该模块存在于Master之中。资源分配模块确定Framework接受offer的顺序,与此同时,确保在资源利用最大化的条件下公平地共享资源。


由于Mesos为跨数据中心调度资源并且是异构的资源需求时,资源分配相比普通调度将会更加困难。因此Mesos采用了DRF(主导资源公平算法 Dominant Resource Fairness)

Framework拥有的全部资源类型份额中占最高百分比的就是Framework的主导份额。DRF算法会使用所有已注册的Framework来计算主导份额,以确保每个Framework能接收到其主导资源的公平份额。


考虑一个9CPU,18GBRAM的系统,拥有两个用户,其中用户A运行的任务的需求向量为{1CPU, 4GB},用户B运行的任务的需求向量为{3CPU,1GB},用户可以执行尽量多的任务来使用系统的资源。

在上述方案中,A的每个任务消耗总cpu的1/9和总内存的2/9,所以A的dominant resource是内存;B的每个任务消耗总cpu的1/3和总内存的1/18,所以B的dominant resource为CPU。DRF会均衡用户的dominant shares,执行3个用户A的任务,执行2个用户B的任务。三个用户A的任务总共消耗了{3CPU,12GB},两个用户B的任务总共消耗了{6CPU,2GB};在这个分配中,每一个用户的dominant
share是相等的,用户A获得了2/3的RAM,而用户B获得了2/3的CPU。

以上的这个分配可以用如下方式计算出来:x和y分别是用户A和用户B的分配任务的数目,那么用户A消耗了{xCPU,4xGB},用户B消耗了{3yCPU,yGB},在图三中用户A和用户B消耗了同等dominant resource;用户A的dominant share为4x/18,用户B的dominant share为3y/9。所以DRF分配可以通过求解以下的优化问题来得到:

max(x,y)            
#(Maximize allocations)
subject to
x + 3y <= 9 #(CPU constraint)
4x + y <= 18 #(Memory Constraint)
2x/9 = y/3 #(Equalize dominant shares)

最后解出x=3以及y=2,因而用户A获得{3CPU,12GB},B得到{6CPU, 2GB}。


HierarchicalDRF核心算法


实现在文件Src/main/allocator/mesos/hierarchical.cpp中




不是每次把所有的资源都给所有的framework,而是根据资源分配算法,每个framework拿到的不同





总的来说分两大步:
先保证有quota的role
然后其他的资源没有quota的再分

在每一步Hierachical的意思是两层排序
一层是按照role排序
第二层是相同的role的不同Framework排序

每一层的排序都是按照计算的share进行排序来先给谁,再给谁






Quota, Reservation, Role, Weight

  • 每个Framework可以有Role,既用于权限,也用于资源分配
  • 可以给某个role在offerResources的时候回复Offer::Operation::RESERVE,来预订某台slave上面的资源。Reservation是很具体的,具体到哪台机器的多少资源属于哪个Role
  • Quota是每个Role的最小保证量,但是不具体到某个节点,而是在整个集群中保证有这么多就行了。
  • Reserved资源也算在Quota里面。
  • 不同的Role之间可以有Weight
最后将resource交给每一个Framework







在Allocator初始化的时候,最后定义每Interval运行一次

offerCallback是注册进来的函数,请记住


6. flags.registry == “in_memory“ or flags.registry == ”replicated_log“ 
信息存储在内存,zk,本地文件夹

7. 选举和检测谁是Leader

Try<MasterContender*> contender_ = MasterContender::create(zk, flags.master_contender);
Try<MasterDetector*> detector_ = MasterDetector::create(zk, flags.master_detector);


8. 生成Master对象,启动Master线程

Master* master =
    new Master(
      allocator.get(),
      registrar,
      &files,
      contender,
      detector,
      authorizer_,
      slaveRemovalLimiter,
      flags);

 process::spawn(master);
 process::wait(master->self());

未经允许不得转载:冰点网络 » Mesos原理与代码分析(2): Mesos-Master的启动

分享到:更多 ()

评论 抢沙发

评论前必须登录!