> ## Documentation Index > Fetch the complete documentation index at: https://wb-21fd5541-dependabot-github-actions-actions-cache-6.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # NVIDIA NeMo Inference Microservice 배포 작업 > 확장 가능한 모델 서빙을 위해 W&B Launch를 사용해 W&B 모델 artifact를 NVIDIA NeMo Inference Microservice에 배포합니다. W\&B의 모델 artifact를 NVIDIA NeMo Inference Microservice에 배포합니다. 이를 위해 W\&B Launch를 사용합니다. W\&B Launch는 모델 artifact를 NVIDIA NeMo Model로 변환한 뒤, 실행 중인 NIM/Triton 서버에 배포합니다. 현재 W\&B Launch는 다음 호환 모델 유형을 지원합니다: 1. [Llama2](https://llama.meta.com/llama2/) 2. [StarCoder](https://github.com/bigcode-project/starcoder) 3. NV-GPT (곧 제공 예정) 배포 시간은 모델과 머신 유형에 따라 달라집니다. 기본 Llama2-7b 설정은 Google Cloud의 `a2-ultragpu-1g`에서 약 1분 정도 걸립니다.

## 퀵스타트

1. 아직 없다면 [Launch 큐를 생성하세요](/ko/platform/launch/add-job-to-queue/). 아래에 예시 큐 설정이 나와 있습니다. ```yaml theme={null} net: host gpus: all # 특정 GPU 집합을 지정하거나, 모든 GPU를 사용하려면 `all`을 사용할 수 있습니다 runtime: nvidia # nvidia container runtime도 필요합니다 volume: - model-store:/model-store/ ```

2. 프로젝트에 이 작업을 생성합니다: ```bash theme={null} wandb job create -n "deploy-to-nvidia-nemo-inference-microservice" \ -e $ENTITY \ -p $PROJECT \ -E jobs/deploy_to_nvidia_nemo_inference_microservice/job.py \ -g andrew/nim-updates \ git https://github.com/wandb/launch-jobs ``` 3. GPU 머신에서 agent를 실행합니다: ```bash theme={null} wandb launch-agent -e $ENTITY -p $PROJECT -q $QUEUE ``` 4. [Launch UI](https://wandb.ai/launch)에서 원하는 설정으로 배포 launch 작업을 제출합니다. 1. CLI를 통해서도 제출할 수 있습니다: ```bash theme={null} wandb launch -d gcr.io/playground-111/deploy-to-nemo:latest \ -e $ENTITY \ -p $PROJECT \ -q $QUEUE \ -c $CONFIG_JSON_FNAME ```

5. Launch UI에서 배포 진행 상태를 추적할 수 있습니다.

6. 완료되면 바로 엔드포인트에 `curl` 요청을 보내 모델을 테스트할 수 있습니다. 모델 이름은 항상 `ensemble`입니다. ```bash theme={null} #!/bin/bash curl -X POST "http://0.0.0.0:9999/v1/completions" \ -H "accept: application/json" \ -H "Content-Type: application/json" \ -d '{ "model": "ensemble", "prompt": "Tell me a joke", "max_tokens": 256, "temperature": 0.5, "n": 1, "stream": false, "stop": "string", "frequency_penalty": 0.0 }' ```