The Scheduler Overview
As many concurrent user prompts may arrive at once, the primary objective of the scheduler within nano-VLLM is to 1) determine whether to perform a “prefill” or a “decode” step and 2) determine which sequences to run next. It does this by keeping track of all sequences that are currently running or waiting to be run and uses a block manager to determine whether or not to pre-empt a sequence based on available free blocks.
Whenever a new request arrives within the LLM engine, its “sequence” object is registered with the scheduler. During every step of the LLM engine, the scheduler is first called to fetch the next sequences to be run and to determine the step type (decode/prefill).
Prefill Step: The primary objective of the prefill step is to store all KV caches for the input tokens so that they can be reused during each decoding step. It is done on the tokens of the entire input prompt in parallel as all of the tokens are already present and requires O(n^2) compute due to the attention computation (where n is the number of tokens within the prompt). The prefill step also produces the first token in the response.
Decode Step: The rest of the tokens generated within the response are done by the decode step, generates a single token (one at a time) within the response and appends the keys and values to the KV cache. As only one token is being processed at a time, it only requires O(n) compute complexity to compute the QK dot product.
Breaking Down the Scheduler Constructor
def __init__(self, config: Config):
self.max_num_seqs = config.max_num_seqs
self.max_num_batched_tokens = config.max_num_batched_tokens
self.eos = config.eos
self.block_manager = BlockManager(config.num_kvcache_blocks, config.kvcache_block_size)
self.waiting: deque[Sequence] = deque()
self.running: deque[Sequence] = deque()
Within the scheduler constructor, VLLM defines the following variables that are used within the scheduler
- max_num_seqs: this is the maximum number of sequences that can be concurrently running in a within the same batch
- max_num_batched_tokens: the maximum number of tokens that can be processed within the same batch
- eos: the end-of-sequence token that when predicted, should stop generation for the sequence (remove sequence from running queue) and update the sequence to a finished state
- block manager: an object used to keep track of the number of available KV cache blocks used for paged attention (more on this in a separate post!)
- waiting: keeps a queue of sequences that are registered by the scheduler but have not begun predictions yet (yet to prefill)
- running: keeps a queue of sequences that are currently running predictions (post-prefill)
Breaking Down schedule()
def schedule(self) -> tuple[list[Sequence], bool]:
# prefill
scheduled_seqs = []
num_seqs = 0
num_batched_tokens = 0
while self.waiting and num_seqs < self.max_num_seqs:
seq = self.waiting[0]
if num_batched_tokens + len(seq) > self.max_num_batched_tokens or not self.block_manager.can_allocate(seq):
break
num_seqs += 1
self.block_manager.allocate(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens
seq.status = SequenceStatus.RUNNING
self.waiting.popleft()
self.running.append(seq)
scheduled_seqs.append(seq)
if scheduled_seqs:
return scheduled_seqs, True
# decode
while self.running and num_seqs < self.max_num_seqs:
seq = self.running.popleft()
while not self.block_manager.can_append(seq):
if self.running:
self.preempt(self.running.pop())
else:
self.preempt(seq)
break
else:
num_seqs += 1
self.block_manager.may_append(seq)
scheduled_seqs.append(seq)
assert scheduled_seqs
self.running.extendleft(reversed(scheduled_seqs))
return scheduled_seqs, False
During every step of the LLM engine, the schedule function is first called to determine whether to perform a “prefill” step or a “decode” step and to determine which sequence to prefill or decode. Currently, a step (all sequences within a batch) can only be one of prefill or decode but not both.
scheduled_seqs = []
num_seqs = 0
num_batched_tokens = 0
The nano-VLLM scheduler first initializes scheduled_seqs to store the sequences to prefill or decode in the current step. It also initializes num_seqs and num_batched_tokens to keep track of the numbers of sequences and number of tokens that have already been scheduled in the current step.
Breaking Down the Prefill Block
while self.waiting and num_seqs < self.max_num_seqs:
seq = self.waiting[0]
if num_batched_tokens + len(seq) > self.max_num_batched_tokens or not self.block_manager.can_allocate(seq):
break
num_seqs += 1
self.block_manager.allocate(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens
seq.status = SequenceStatus.RUNNING
self.waiting.popleft()
self.running.append(seq)
scheduled_seqs.append(seq)
if scheduled_seqs:
return scheduled_seqs, True
Within the prefill block, nano-VLLM first checks two conditions before iterating 1) whether there are any sequences waiting in the “waiting” queue and 2) whether the number of scheduled sequences is less than the maximum number of sequences.
Next, scheduler then fetches the first sequence from the waiting queue. Based on the sequence, it checks two additional conditions 1) whether adding this sequence will exceed the number of tokens quota for the current batch and 2) whether there are enough free kv cache blocks based on the block manager.
If both of these conditions are satisfied, the scheduler then schedules the sequence by performing the following:
- update the count of sequences scheduled in the current batch
- register the sequence with the block manager so that the block manager can update the number of free KV cache blocks
- update the number of tokens within the current batch
- set the status of the sequence to running
- remove the sequence from the “waiting” list
- add the sequence to the “running” list
At the end of the prefill block, there is a check to see if any sequences have been scheduled for prefill. If there is at least one sequence, then the sequences are returned and the prefill flag is set to True.
Breaking Down the Decode Block
# decode
while self.running and num_seqs < self.max_num_seqs:
seq = self.running.popleft()
while not self.block_manager.can_append(seq):
if self.running:
self.preempt(self.running.pop())
else:
self.preempt(seq)
break
else:
num_seqs += 1
self.block_manager.may_append(seq)
scheduled_seqs.append(seq)
assert scheduled_seqs
self.running.extendleft(reversed(scheduled_seqs))
return scheduled_seqs, False