1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| (ravllm) root@local:~ INFO 11-01 01:33:42 importing.py:13] Triton not installed; certain GPU-related functions will not be available. INFO 11-01 01:33:45 api_server.py:528] vLLM API server version 0.6.3.post2.dev76+g51c24c97 INFO 11-01 01:33:45 api_server.py:529] args: Namespace(subparser='serve', model_tag='/root/modelspace/Qwen2.5-1.5B-Instruct', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/modelspace/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen/Qwen2.5-1.5B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7602e53ec700>) WARNING 11-01 01:33:52 arg_utils.py:1038] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-01 01:33:52 config.py:421] Async output processing is only supported for CUDA, TPU, XPU. Disabling it for other platforms. INFO 11-01 01:33:52 llm_engine.py:240] Initializing an LLM engine (v0.6.3.post2.dev76+g51c24c97) with config: model='/root/modelspace/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/root/modelspace/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) WARNING 11-01 01:33:53 cpu_executor.py:332] CUDA graph is not supported on CPU, fallback to the eager mode. WARNING 11-01 01:33:53 cpu_executor.py:362] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default. INFO 11-01 01:33:56 importing.py:13] Triton not installed; certain GPU-related functions will not be available. (VllmWorkerProcess pid=124634) INFO 11-01 01:33:59 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on CPU. (VllmWorkerProcess pid=124634) INFO 11-01 01:33:59 selector.py:131] Using Torch SDPA backend. (VllmWorkerProcess pid=124634) INFO 11-01 01:33:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=124634) INFO 11-01 01:33:59 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on CPU. (VllmWorkerProcess pid=124634) INFO 11-01 01:33:59 selector.py:131] Using Torch SDPA backend. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.43it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.43it/s] (VllmWorkerProcess pid=124634) INFO 11-01 01:34:00 cpu_executor.py:214] WARNING 11-01 01:34:01 serving_embedding.py:200] embedding_mode is False. Embedding API will not work. INFO 11-01 01:34:01 launcher.py:19] Available routes are: INFO 11-01 01:34:01 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET INFO 11-01 01:34:01 launcher.py:27] Route: /docs, Methods: HEAD, GET INFO 11-01 01:34:01 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 11-01 01:34:01 launcher.py:27] Route: /redoc, Methods: HEAD, GET INFO 11-01 01:34:01 launcher.py:27] Route: /health, Methods: GET INFO 11-01 01:34:01 launcher.py:27] Route: /tokenize, Methods: POST INFO 11-01 01:34:01 launcher.py:27] Route: /detokenize, Methods: POST INFO 11-01 01:34:01 launcher.py:27] Route: /v1/models, Methods: GET INFO 11-01 01:34:01 launcher.py:27] Route: /version, Methods: GET INFO 11-01 01:34:01 launcher.py:27] Route: /v1/chat/completions, Methods: POST INFO 11-01 01:34:01 launcher.py:27] Route: /v1/completions, Methods: POST INFO 11-01 01:34:01 launcher.py:27] Route: /v1/embeddings, Methods: POST INFO: Started server process [124595] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit) INFO 11-01 01:34:11 metrics.py:363] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 11-01 01:34:21 metrics.py:363] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
|