Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most critical transactions get a significant share of system resources. Resource management in a distributed environment makes accessibility of data more accessible and manages resources over the network of autonomous computers (ie, Distributed systems). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to hive for the highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. To be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups, as well as best practices and considerations to think about before setting up a production system with resource grouping.

Getting Started

Presto has multiple “resources” that it can manage resource quotas for. The two main resources are CPU and memory. Additionally, there are granular resource constraints that can be specified, such as concurrency, time, and cpuTime. All of this is done via a pretty ugly JSON configuration file shown in the example below from the PrestoDB doc pages.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "jmxExport": true,
      "subGroups": [
        {
          "name": "data_definition",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 100,
          "schedulingWeight": 1
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "10%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 1,
          "schedulingWeight": 10,
          "subGroups": [
            {
              "name": "other",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 2,
              "maxQueued": 1,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 1,
                  "maxQueued": 100
                }
              ]
            },
            {
              "name": "bi-${tool_name}",
              "softMemoryLimit": "10%",
              "hardConcurrencyLimit": 10,
              "maxQueued": 100,
              "schedulingWeight": 10,
              "schedulingPolicy": "weighted_fair",
              "subGroups": [
                {
                  "name": "${USER}",
                  "softMemoryLimit": "10%",
                  "hardConcurrencyLimit": 3,
                  "maxQueued": 10
                }
              ]
            }
          ]
        },
        {
          "name": "pipeline",
          "softMemoryLimit": "80%",
          "hardConcurrencyLimit": 45,
          "maxQueued": 100,
          "schedulingWeight": 1,
          "jmxExport": true,
          "subGroups": [
            {
              "name": "pipeline_${USER}",
              "softMemoryLimit": "50%",
              "hardConcurrencyLimit": 5,
              "maxQueued": 100
            }
          ]
        }
      ]
    },
    {
      "name": "admin",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 50,
      "maxQueued": 100,
      "schedulingPolicy": "query_priority",
      "jmxExport": true
    }
  ],
  "selectors": [
    {
      "user": "bob",
      "group": "admin"
    },
    {
      "source": ".*pipeline.*",
      "queryType": "DATA_DEFINITION",
      "group": "global.data_definition"
    },
    {
      "source": ".*pipeline.*",
      "group": "global.pipeline.pipeline_${USER}"
    },
    {
      "source": "jdbc#(?<tool_name>.*)",
      "clientTags": ["hipri"],
      "group": "global.adhoc.bi-${tool_name}.${USER}"
    },
    {
      "group": "global.adhoc.other.${USER}"
    }
  ],
  "cpuQuotaPeriod": "1h"
}

Okay, so there is clearly a LOT going on here, so let’s start with the basics and roll our way up. The first place to start is understanding the mechanisms Presto uses to enforce query resource limits.

Penalties

Presto doesn’t enforce any resources at execution time. Rather, Presto introduces a concept of a ‘penalty’ for users who exceed their resource specification. For example, if user ‘bob’ were to kick off a huge query that ended up taking vastly more CPU time than allotted, then ‘bob’ would incur a penalty which translates to an amount of time that ‘bob’s’ queries would be forced to Wait in a queued state until they could be runnable again. To see this scenario at hand, let’s split the cluster resources by half and see what happens when two users attempt to submit 5 different queries each at the same time.

Resource Group Specifications

The example below is a resource specification of how to distribute CPU resources between two different users even.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 5,
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1h"
}

The above resource config defines two main resource groups called ‘query1’ and ‘query2’. These groups will serve as buckets for the different queries/users. A few parameters are at work here:

  • hardConcurrencyLimit sets the number of concurrent queries that can be run within the group
  • maxQueued sets the limit on how many queries can be queued
  • schedulingPolicy ‘fair’ determines how queries within the same group are prioritized

Kicking off a single query as each user has no effect, but subsequent queries will stay QUEUED until the first completes. This at least confirms the hardConcurrencyLimit setting. Testing queuing 6 queries also shows that the maxQueued is working as intended.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "30s",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

Introducing the soft CPU limit will penalize any query that is caught using too much CPU time in a given CPU period. Currently, this is set to 1 minute, and each group is given half of that CPU time. However, testing the above configuration yielded some odd results. Mostly, once the first query finished, other queries were queued for an inordinately long time. Looking at the Presto source code shows the reasoning. The softCpuLimit and hardCpuLimit are based on a combination of total cores and the cpuQuotaPeriod. For example, on a 10 node cluster with r5.2xlarge instances, each Presto Worker node has 8 vCPU. This leads to a total of 80 vCPU for the worker, which then results in 80m of vCPUminutes in the given cpuQuotaPeriod. Therefore, the correct values ​​are shown below.

{
 "rootGroups": [
   {
     "name": "query1",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   },
   {
     "name": "query2",
     "softMemoryLimit": "50%",
     "hardConcurrencyLimit": 1,
     "maxQueued": 1,
     "softCpuLimit": "40m",
     "schedulingPolicy": "fair",
     "jmxExport": true
   }
 ],
 "selectors": [
   {
     "user": "alice",
     "group": "query1"
   },
   {
     "user": "bob",
     "group": "query2"
   }
 ],
 "cpuQuotaPeriod": "1m"
}

With testing, the above resource group spec results in two queries – using a total of 127m CPU time. All further queries block for about 2 minutes before they run again. This blocked time adds up because, for every minute of cpuQuotaPeriod, each user is granted 40 minutes back on their penalty. Since the first minute queries exceeded by 80+ minutes, it would take 2 cpuQuotaPeriods to bring the penalty back to zero so queries could submit again.

Conclusion

Resource group implementation in Presto definitely has some room for improvement. The most obvious is that for ad hoc users who may not understand the cost of their query before execution, the resource group will heavily penalize them until they submit very low-cost queries. However, this solution will minimize the damage that a single user can perform on a cluster over an extended duration and will average out in the long run. Overall, resource groups are better suited for scheduled workloads that depend on variable input data. A scheduled job doesn’t arbitrarily specified end up taking over a large chunk of resources. For resource partitioning between multiple users/teams, the best approach still seems to be to run and maintain multiple segregated Presto clusters.

.

Leave a Comment