[GH-ISSUE #351] How can I determine if Humanify is using CUDA or CPU locally? #71

Open
opened 2026-03-03 13:52:45 +03:00 by kerem · 6 comments
Owner

Originally created by @GermanKousal on GitHub (Mar 2, 2025).
Original GitHub issue: https://github.com/jehna/humanify/issues/351

Hello everyone,

I'm currently using humanify to process a 4MB obfuscated file. After several hours, the progress still shows 0%, which leads me to suspect that the tool might be running on CPU instead of utilizing CUDA acceleration.

Is there a way to verify whether humanify is using CUDA or CPU on my system? Additionally, is it possible to force the tool to use CUDA, and if CUDA is not available, have it fail immediately?

Any help or suggestions would be greatly appreciated. Thank you!

Forgot to say Im using HumanifyJS through WSL.

Originally created by @GermanKousal on GitHub (Mar 2, 2025). Original GitHub issue: https://github.com/jehna/humanify/issues/351 Hello everyone, I'm currently using humanify to process a 4MB obfuscated file. After several hours, the progress still shows 0%, which leads me to suspect that the tool might be running on CPU instead of utilizing CUDA acceleration. Is there a way to verify whether humanify is using CUDA or CPU on my system? Additionally, is it possible to force the tool to use CUDA, and if CUDA is not available, have it fail immediately? Any help or suggestions would be greatly appreciated. Thank you! Forgot to say Im using HumanifyJS through WSL.
Author
Owner

@KyleSau commented on GitHub (Mar 12, 2025):

This as well

<!-- gh-comment-id:2717337922 --> @KyleSau commented on GitHub (Mar 12, 2025): This as well
Author
Owner

@0xdevalias commented on GitHub (Mar 24, 2025):

After several hours, the progress still shows 0%, which leads me to suspect that the tool might be running on CPU instead of utilizing CUDA acceleration.

@GermanKousal / @KyleSau Do you have any logs from running it? Looking at some older issues (eg. https://github.com/jehna/humanify/issues/50) it seems to print details about the GPU being used.


Is there a way to verify whether humanify is using CUDA or CPU on my system? Additionally, is it possible to force the tool to use CUDA, and if CUDA is not available, have it fail immediately?

Following along a similar path of debugging that I did in the past (https://github.com/jehna/humanify/issues/53#issuecomment-2306107630):

⇒ npm start -- -h

> humanifyjs@2.2.2 start
> tsx src/index.ts -h

..snip..

Usage: humanify [options] [command]

..snip..

Commands:
  local [options] <input>   Use a local LLM to unminify code

..snip..
⇒ npm start -- local -h

> humanifyjs@2.2.2 start
> tsx src/index.ts local -h

..snip..

Use a local LLM to unminify code

Arguments:
  input                        The input minified Javascript file

Options:
  -m, --model <model>          The model to use (default: "2b")
  -o, --outputDir <output>     The output directory (default: "output")
  -s, --seed <seed>            Seed for the model to get reproduceable results (leave out for random seed)
  --disableGpu                 Disable GPU acceleration
  --verbose                    Show verbose output
  --contextSize <contextSize>  The context size to use for the LLM (default: "1000")
  -h, --help                   display help for command

The main part of the CLI is set up in src/index.ts, and then loads the local command from src/commands/local.ts:

github.com/jehna/humanify@ad3a03301e/src/index.ts (L4-L13)

Among other things, the local command has --disableGpu / --contextSize args:

github.com/jehna/humanify@ad3a03301e/src/commands/local.ts (L22-L28)

Which are passed in to llama (src/plugins/local-llm-rename/llama.ts), and then unminify (src/unminify.ts):

github.com/jehna/humanify@ad3a03301e/src/commands/local.ts (L2)

github.com/jehna/humanify@ad3a03301e/src/commands/local.ts (L4)

github.com/jehna/humanify@ad3a03301e/src/commands/local.ts (L37-L47)

The llama function uses node-llama-cpp's getLlama, and then calls loadModel, createContext, etc (passing in disableGpu, etc):

github.com/jehna/humanify@ad3a03301e/src/plugins/local-llm-rename/llama.ts (L1-L6)

github.com/jehna/humanify@ad3a03301e/src/plugins/local-llm-rename/llama.ts (L19-L33)

Skimming through the node-llama-cpp docs for those sorts of functions and similar:

Based on that, it seems that node-llama-cpp gives us a whole bunch of options for controlling if/what GPU aspects are used, and inspecting that later on.

Looking back at humanify's implementation with this new knowledge, we can see that when --disableGpu is passed (or if IS_CI), then getLlama's gpu option is set to auto:

github.com/jehna/humanify@ad3a03301e/src/plugins/local-llm-rename/llama.ts (L24-L25)

I would need to run humanify local and look more specifically at what it currently logs + what it logs in --verbose / any debug modes / etc to know if it's possible to see when running on GPU / CPU in its current form; and I suspect based on the above that code changes would be required to implement a 'force GPU' mode.

When I get a chance, I'll try running things locally and debugging further as to what is available, and think more about which of the above features would make sense to combine to improve the current state of humanify; though obviously if you get a chance to do similar before me, I would love to hear any insights/experiments/etc you discover.


This older issue sounds like it might be a little bit similar to the root issue you seem to be observing.. so it might be useful knowledge as well:

it has always been at 0%.

Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466121481

What's your gpu? To me it seems that it just takes a lot of time to process if there's no other errors.

Have you tried --disableGpu?

Originally posted by @jehna in https://github.com/jehna/humanify/issues/207#issuecomment-2466168014

Thank you for your reply, my GPU is AMD Radeon Pro 5300M. I will try the suggestions you've given.

Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466539350

There has been progress, thank you for your solution, but the progress bar is still too slow

Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466544672

AMD Radeon Pro 5300M

If I googled correctly, there's 4gb of memory, which is probably too low for the model. I think you'd need much beefier gpu to run Humanify locally

Originally posted by @jehna in https://github.com/jehna/humanify/issues/207#issuecomment-2466627574

<!-- gh-comment-id:2746798639 --> @0xdevalias commented on GitHub (Mar 24, 2025): > After several hours, the progress still shows 0%, which leads me to suspect that the tool might be running on CPU instead of utilizing CUDA acceleration. @GermanKousal / @KyleSau Do you have any logs from running it? Looking at some older issues (eg. https://github.com/jehna/humanify/issues/50) it seems to print details about the GPU being used. --- > Is there a way to verify whether humanify is using CUDA or CPU on my system? Additionally, is it possible to force the tool to use CUDA, and if CUDA is not available, have it fail immediately? Following along a similar path of debugging that I did in the past (https://github.com/jehna/humanify/issues/53#issuecomment-2306107630): ```shell ⇒ npm start -- -h > humanifyjs@2.2.2 start > tsx src/index.ts -h ..snip.. Usage: humanify [options] [command] ..snip.. Commands: local [options] <input> Use a local LLM to unminify code ..snip.. ``` ```shell ⇒ npm start -- local -h > humanifyjs@2.2.2 start > tsx src/index.ts local -h ..snip.. Use a local LLM to unminify code Arguments: input The input minified Javascript file Options: -m, --model <model> The model to use (default: "2b") -o, --outputDir <output> The output directory (default: "output") -s, --seed <seed> Seed for the model to get reproduceable results (leave out for random seed) --disableGpu Disable GPU acceleration --verbose Show verbose output --contextSize <contextSize> The context size to use for the LLM (default: "1000") -h, --help display help for command ``` The main part of the CLI is set up in `src/index.ts`, and then loads the `local` command from `src/commands/local.ts`: https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/index.ts#L4-L13 Among other things, the `local` command has `--disableGpu` / `--contextSize` args: https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/commands/local.ts#L22-L28 Which are passed in to `llama` (`src/plugins/local-llm-rename/llama.ts`), and then `unminify` (`src/unminify.ts`): https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/commands/local.ts#L2 https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/commands/local.ts#L4 https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/commands/local.ts#L37-L47 The `llama` function uses `node-llama-cpp`'s [`getLlama`](https://node-llama-cpp.withcat.ai/api/functions/getLlama), and then calls [`loadModel`](https://node-llama-cpp.withcat.ai/api/classes/Llama#loadmodel), [`createContext`](https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#createcontext), etc (passing in `disableGpu`, etc): https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/plugins/local-llm-rename/llama.ts#L1-L6 https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/plugins/local-llm-rename/llama.ts#L19-L33 Skimming through the `node-llama-cpp` docs for those sorts of functions and similar: - https://github.com/withcatai/node-llama-cpp - > Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level - https://node-llama-cpp.withcat.ai/ - https://node-llama-cpp.withcat.ai/guide/building-from-source#building-inside-your-app - > Building Inside Your App > > The best way to use a customized build is by customizing the options passed to the [`getLlama`](https://node-llama-cpp.withcat.ai/api/functions/getLlama). - https://node-llama-cpp.withcat.ai/guide/building-from-source#customize-build - > Customizing the Build[ > > - **Meta:** To configure Metal support see the [Metal support guide](https://node-llama-cpp.withcat.ai/guide/Metal). > - **CUDA:** To configure CUDA support see the [CUDA support guide](https://node-llama-cpp.withcat.ai/guide/CUDA). > - **Vulkan:** To configure Vulkan support see the [Vulkan support guide](https://node-llama-cpp.withcat.ai/guide/Vulkan). > > `llama.cpp` has CMake build options that can be configured to customize the build. - https://node-llama-cpp.withcat.ai/api/functions/getLlama - > Get a `llama.cpp` binding. - > Defaults to use a local binary built using the source download or source build CLI commands if one exists, otherwise, uses a prebuilt binary, and fallbacks to building from source if a prebuilt binary is not found. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#gpu - > The compute layer implementation type to use for llama.cpp. > > - **`"auto"`**: Automatically detect and use the best GPU available (Metal on macOS, and CUDA or Vulkan on Windows and Linux) > - **`"metal"`**: Use Metal. Only supported on macOS. Enabled by default on Apple Silicon Macs. > - **`"cuda"`**: Use CUDA. > - **`"vulkan"`**: Use Vulkan. > - **`false`**: Disable any GPU support and only use the CPU. > > `"auto"` by default. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#build - > Set what build method to use. > > - **`"auto"`**: If a local build is found, use it. Otherwise, if a prebuilt binary is found, use it. Otherwise, build from source. > - **`"never"`**: If a local build is found, use it. Otherwise, if a prebuilt binary is found, use it. Otherwise, throw a `NoBinaryFoundError` error. > - **`"forceRebuild"`**: Always build from source. Be cautious with this option, as it will cause the build to fail on Windows when the binaries are in use by another process. > - **`"try"`**: If a local build is found, use it. Otherwise, try to build from source and use the resulting binary. If building from source fails, use a prebuilt binary if found. > > When running from inside an Asar archive in Electron, building from source is not possible, so it'll never build from source. To allow building from source in Electron apps, make sure you ship `node-llama-cpp` as an unpacked module. > > Defaults to `"auto"`. On Electron, defaults to `"never"`. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#cmakeoptions - > Set custom CMake options for `llama.cpp` - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#existingprebuiltbinarymustmatchbuildoptions - > When a prebuilt binary is found, only use it if it was built with the same build options as the ones specified in `buildOptions`. Disabled by default. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#useprebuiltbinaries - > Use prebuilt binaries if they match the build options. Enabled by default. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#debug - https://node-llama-cpp.withcat.ai/api/classes/Llama - https://node-llama-cpp.withcat.ai/api/classes/Llama#gpu - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaGpuType - `type LlamaGpuType = "metal" | "cuda" | "vulkan" | false;` - https://node-llama-cpp.withcat.ai/api/classes/Llama#supportsgpuoffloading - https://node-llama-cpp.withcat.ai/api/classes/Llama#supportsmmap - https://node-llama-cpp.withcat.ai/api/classes/Llama#gpusupportsmmap - https://node-llama-cpp.withcat.ai/api/classes/Llama#cpumathcores - https://node-llama-cpp.withcat.ai/api/classes/Llama#maxthreads - > The maximum number of threads that can be used by the Llama instance. > > If set to 0, the Llama instance will have no limit on the number of threads. > > See the `maxThreads` option of `getLlama` for more information. - https://node-llama-cpp.withcat.ai/api/classes/Llama#llamacpprelease - https://node-llama-cpp.withcat.ai/api/classes/Llama#systeminfo - https://node-llama-cpp.withcat.ai/api/classes/Llama#getvramstate - > The total amount of VRAM that is currently being used. > > `unifiedSize` represents the amount of VRAM that is shared between the CPU and GPU. On SoC devices, this is usually the same as `total`. - https://node-llama-cpp.withcat.ai/api/classes/Llama#getswapstate - > Get the state of the swap memory. > > - `maxSize` - The maximum size of the swap memory that the system can allocate. If the swap size is dynamic (like on macOS), this will be Infinity. > - `allocated` - The total size allocated by the system for swap memory. > - `used` - The amount of swap memory that is currently being used from the allocated size. > > On Windows, this will return the info for the page file. - https://node-llama-cpp.withcat.ai/api/classes/Llama#getgpudevicenames - https://node-llama-cpp.withcat.ai/api/classes/Llama#loadmodel - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions#gpulayers - > Number of layers to store in VRAM. > > - **`"auto"`** - adapt to the current VRAM state and try to fit as many layers as possible in it. Takes into account the VRAM required to create a context with a `contextSize` set to `"auto"`. > - **`"max"`** - store all layers in VRAM. If there's not enough VRAM, an error will be thrown. Use with caution. > - **`number`** - store the specified number of layers in VRAM. If there's not enough VRAM, an error will be thrown. Use with caution. > - **`{min?: number, max?: number, fitContext?: {contextSize: number}}`** - adapt to the current VRAM state and try to fit as many layers as possible in it, but at least `min` and at most `max` layers. Set `fitContext` to the parameters of a context you intend to create with the model, so it'll take it into account in the calculations and leave enough memory for such a context. > > If GPU support is disabled, will be set to `0` automatically. > > Defaults to `"auto"`. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions#onloadprogress - > Called with the load percentage when the model is being loaded. - https://node-llama-cpp.withcat.ai/api/classes/LlamaModel - https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#gpulayers - > Number of layers offloaded to the GPU. If GPU support is disabled, this will always be `0`. - https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#size - > Total model size in memory in bytes. > > When using mmap, actual memory usage may be higher than this value due to `llama.cpp`'s performance optimizations. - https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#createcontext - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaContextOptions - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaContextOptions#threads - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaContextOptions#failedcreationremedy - > On failed context creation, retry the creation with a smaller context size. > > Only works if `contextSize` is set to `"auto"`, left as default or set to an object with `min` and/or `max` properties. > > Set `retries` to `false` to disable. - https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaContextOptions#performancetracking - > Track the inference performance of the context, so using `.printTimings()` will work. > > Defaults to `false`. - https://node-llama-cpp.withcat.ai/api/classes/LlamaContext - https://node-llama-cpp.withcat.ai/api/classes/LlamaContext#currentthreads - > The number of threads currently used to evaluate tokens - https://node-llama-cpp.withcat.ai/api/classes/LlamaContext#idealthreads - > The number of threads that are preferred to be used to evaluate tokens. > > The actual number of threads used may be lower when other evaluations are running in parallel. - https://node-llama-cpp.withcat.ai/api/classes/LlamaContext#printtimings - > Print the timings of token evaluation since that last print for this context. > > Requires the `performanceTracking` option to be enabled. > > > **Note:** it prints on the `LlamaLogLevel.info` level, so if you set the level of your `Llama` instance higher than that, it won't print anything. - etc Based on that, it seems that `node-llama-cpp` gives us a whole bunch of options for controlling if/what GPU aspects are used, and inspecting that later on. Looking back at `humanify`'s implementation with this new knowledge, we can see that when `--disableGpu` is passed (or if `IS_CI`), then [`getLlama`](https://node-llama-cpp.withcat.ai/api/functions/getLlama)'s [`gpu`](https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#gpu) option is set to `auto`: https://github.com/jehna/humanify/blob/ad3a03301e504bd1f2566cbb95c868a420d9cdc9/src/plugins/local-llm-rename/llama.ts#L24-L25 I would need to run `humanify local` and look more specifically at what it currently logs + what it logs in `--verbose` / any debug modes / etc to know if it's possible to see when running on GPU / CPU in its current form; and I suspect based on the above that code changes would be required to implement a 'force GPU' mode. When I get a chance, I'll try running things locally and debugging further as to what is available, and think more about which of the above features would make sense to combine to improve the current state of `humanify`; though obviously if you get a chance to do similar before me, I would love to hear any insights/experiments/etc you discover. --- This older issue sounds like it might be a little bit similar to the root issue you seem to be observing.. so it might be useful knowledge as well: > it has always been at 0%. > > _Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466121481_ > What's your gpu? To me it seems that it just takes a lot of time to process if there's no other errors. > > Have you tried `--disableGpu`? > > _Originally posted by @jehna in https://github.com/jehna/humanify/issues/207#issuecomment-2466168014_ > Thank you for your reply, my GPU is AMD Radeon Pro 5300M. I will try the suggestions you've given. > > _Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466539350_ > There has been progress, thank you for your solution, but the progress bar is still too slow > > _Originally posted by @Zeng-aN in https://github.com/jehna/humanify/issues/207#issuecomment-2466544672_ > > AMD Radeon Pro 5300M > > If I googled correctly, there's 4gb of memory, which is probably too low for the model. I think you'd need much beefier gpu to run Humanify locally > > _Originally posted by @jehna in https://github.com/jehna/humanify/issues/207#issuecomment-2466627574_
Author
Owner

@0xdevalias commented on GitHub (Mar 24, 2025):

Another thing worth noting, that may or may not be relevant here, relating to the node-llama-cpp build options and whether it was built for CUDA/etc... I'm thinking maybe for some of the other issues we've seen related to Windows, where it seems to work when run through WSL, but not when run natively, or vice versa.. I wonder if node-llama-cpp is getting built for eg. Windows WSL, but then doesn't have the right binaries available for Windows native or similar (I don't really use windows these days, but wanted to capture the thought while I was having it)

I don't remember where I saw this (assuming I am remembering correctly), but I think I saw something related to builds somewhere in humanify's source.. maybe in a postinstall, or within one of the commands or similar.. I'd need to dig deeper to figure out if I am remembering correctly and if so, where it was.

Edit: I think this was my previous deep dive into this, and it looks as though it may happen as a postinstall on node-llama-cpp itself:

<!-- gh-comment-id:2746805368 --> @0xdevalias commented on GitHub (Mar 24, 2025): Another thing worth noting, that may or may not be relevant here, relating to the `node-llama-cpp` build options and whether it was built for CUDA/etc... I'm thinking maybe for some of the other issues we've seen related to Windows, where it seems to work when run through WSL, but not when run natively, or vice versa.. I wonder if `node-llama-cpp` is getting built for eg. Windows WSL, but then doesn't have the right binaries available for Windows native or similar (I don't really use windows these days, but wanted to capture the thought while I was having it) I don't remember where I saw this (assuming I am remembering correctly), but I think I saw something related to builds somewhere in `humanify`'s source.. maybe in a `postinstall`, or within one of the commands or similar.. I'd need to dig deeper to figure out if I am remembering correctly and if so, where it was. **Edit:** I think this was my previous deep dive into this, and it looks as though it may happen as a `postinstall` on `node-llama-cpp` itself: - https://github.com/jehna/humanify/issues/135#issuecomment-2381851607
Author
Owner

@0xdevalias commented on GitHub (Mar 24, 2025):

There are some notes in here that might also be relevant:

  • https://node-llama-cpp.withcat.ai/guide/CUDA
    • node-llama-cpp ships with pre-built binaries with CUDA support for Windows and Linux, and these are automatically used when CUDA is detected on your machine.

      To use node-llama-cpp's CUDA support with your NVIDIA GPU, make sure you have CUDA Toolkit 12.2 or higher installed on your machine.

      If the pre-built binaries don't work with your CUDA installation, node-llama-cpp will automatically download a release of llama.cpp and build it from source with CUDA support. Building from source with CUDA support is slow and can take up to an hour.

      The pre-built binaries are compiled with CUDA Toolkit 12.2, so any version of CUDA Toolkit that is 12.2 or higher should work with the pre-built binaries. If you have an older version of CUDA Toolkit installed on your machine, consider updating it to avoid having to wait the long build time.

Random pondering: I wonder what sort of output humanify / node-llama-cpp would show if it was having to compile a new version; and whether that would look as though it was hanging on 0% during that process.

This also looks like a good standalone test to help get some more info:

Testing CUDA Support​

To check whether the CUDA support works on your machine, run this command:

npx --no node-llama-cpp inspect gpu

You should see an output like this:

CUDA: available

CUDA device: NVIDIA RTX A6000
CUDA used VRAM: 0.54% (266.88MB/47.65GB)
CUDA free VRAM: 99.45% (47.39GB/47.65GB)

CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Used RAM: 2.51% (1.11GB/44.08GB)
Free RAM: 97.48% (42.97GB/44.08GB)

If you see CUDA used VRAM in the output, it means that CUDA support is working on your machine.

Testing Vulkan Support​

To check whether the Vulkan support works on your machine, run this command:

npx --no node-llama-cpp inspect gpu

You should see an output like this:

Vulkan: available

Vulkan device: NVIDIA RTX A6000
Vulkan used VRAM: 0% (0B/47.99GB)
Vulkan free VRAM: 100% (47.99GB/47.99GB)

CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Used RAM: 2.51% (1.11GB/44.08GB)
Free RAM: 97.48% (42.97GB/44.08GB)

If you see Vulkan used VRAM in the output, it means that Vulkan support is working on your machine.

There's also some further notes a little later on about how to configure the GPU type from code (and recommends to leave it as the default auto), including a check with console.log("GPU type:", llama.gpu):

  • https://node-llama-cpp.withcat.ai/guide/CUDA#using-node-llama-cpp-with-cuda
    • Using node-llama-cpp With CUDA

      It's recommended to use getLlama without specifying a GPU type, so it'll detect the available GPU types and use the best one automatically.

      To do this, just use getLlama without any parameters:

      const llama = await getLlama();
      console.log("GPU type:", llama.gpu);
      

      To force it to use CUDA, you can use the gpu option:

      const llama = await getLlama({
          gpu: "cuda"
      });
      console.log("GPU type:", llama.gpu);
      

      By default, node-llama-cpp will offload as many layers of the model to the GPU as it can fit in the VRAM.

      To force it to offload a specific number of layers, you can use the gpuLayers option:

      const model = await llama.loadModel({
          modelPath,
          gpuLayers: 33 // or any other number of layers you want
      });
      
<!-- gh-comment-id:2746892995 --> @0xdevalias commented on GitHub (Mar 24, 2025): There are some notes in here that might also be relevant: - https://node-llama-cpp.withcat.ai/guide/CUDA - > `node-llama-cpp` ships with pre-built binaries with CUDA support for Windows and Linux, and these are automatically used when CUDA is detected on your machine. > > To use `node-llama-cpp`'s CUDA support with your NVIDIA GPU, make sure you have [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) 12.2 or higher installed on your machine. > > If the pre-built binaries don't work with your CUDA installation, `node-llama-cpp` will automatically download a release of `llama.cpp` and build it from source with CUDA support. Building from source with CUDA support is slow and can take up to an hour. > > The pre-built binaries are compiled with CUDA Toolkit 12.2, so any version of CUDA Toolkit that is 12.2 or higher should work with the pre-built binaries. If you have an older version of CUDA Toolkit installed on your machine, consider updating it to avoid having to wait the long build time. Random pondering: I wonder what sort of output `humanify` / `node-llama-cpp` would show if it was having to compile a new version; and whether that would look as though it was hanging on 0% during that process. This also looks like a good standalone test to help get some more info: > ## [Testing CUDA Support​](https://node-llama-cpp.withcat.ai/guide/CUDA#testing-cuda-support) > > To check whether the CUDA support works on your machine, run this command: > > ```shell > npx --no node-llama-cpp inspect gpu > ``` > > You should see an output like this: > > ```shell > CUDA: available > > CUDA device: NVIDIA RTX A6000 > CUDA used VRAM: 0.54% (266.88MB/47.65GB) > CUDA free VRAM: 99.45% (47.39GB/47.65GB) > > CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz > Used RAM: 2.51% (1.11GB/44.08GB) > Free RAM: 97.48% (42.97GB/44.08GB) > ``` > > If you see `CUDA used VRAM` in the output, it means that CUDA support is working on your machine. > ## [Testing Vulkan Support​](https://node-llama-cpp.withcat.ai/guide/Vulkan#testing-vulkan-support) > > To check whether the Vulkan support works on your machine, run this command: > > ```shell > npx --no node-llama-cpp inspect gpu > ``` > > </div> > > You should see an output like this: > > ```shell > Vulkan: available > > Vulkan device: NVIDIA RTX A6000 > Vulkan used VRAM: 0% (0B/47.99GB) > Vulkan free VRAM: 100% (47.99GB/47.99GB) > > CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz > Used RAM: 2.51% (1.11GB/44.08GB) > Free RAM: 97.48% (42.97GB/44.08GB) > ``` > > If you see `Vulkan used VRAM` in the output, it means that Vulkan support is working on your machine. There's also some further notes a little later on about how to configure the GPU type from code (and recommends to leave it as the default `auto`), including a check with `console.log("GPU type:", llama.gpu)`: - https://node-llama-cpp.withcat.ai/guide/CUDA#using-node-llama-cpp-with-cuda - > ## Using `node-llama-cpp` With CUDA[​](https://node-llama-cpp.withcat.ai/guide/CUDA#using-node-llama-cpp-with-cuda) > > It's recommended to use [`getLlama`](https://node-llama-cpp.withcat.ai/api/functions/getLlama) without specifying a GPU type, so it'll detect the available GPU types and use the best one automatically. > > To do this, just use [`getLlama`](https://node-llama-cpp.withcat.ai/api/functions/getLlama) without any parameters: > > ```typescript > const llama = await getLlama(); > console.log("GPU type:", llama.gpu); > ``` > > To force it to use CUDA, you can use the [`gpu`](https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions#gpu) option: > > ```typescript > const llama = await getLlama({ > gpu: "cuda" > }); > console.log("GPU type:", llama.gpu); > ``` > > By default, `node-llama-cpp` will offload as many layers of the model to the GPU as it can fit in the VRAM. > > To force it to offload a specific number of layers, you can use the [`gpuLayers`](https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions#gpulayers) option: > > ```typescript > const model = await llama.loadModel({ > modelPath, > gpuLayers: 33 // or any other number of layers you want > }); > ```
Author
Owner

@0xdevalias commented on GitHub (Mar 24, 2025):

These commands might also be useful in debugging if/how well certain models might run on your system:

<!-- gh-comment-id:2746923062 --> @0xdevalias commented on GitHub (Mar 24, 2025): These commands might also be useful in debugging if/how well certain models might run on your system: - https://node-llama-cpp.withcat.ai/cli/inspect/estimate - > `inspect estimate` command > Estimate the compatibility of a model with the current hardware - https://node-llama-cpp.withcat.ai/cli/inspect/measure - > `inspect measure` command > Measure VRAM consumption of a GGUF model file with all possible combinations of gpu layers and context sizes
Author
Owner

@0xdevalias commented on GitHub (Mar 24, 2025):

Here's a debug patch I put together to get a bit more info about the various extra bits of info we could inspect:

diff --git a/src/plugins/local-llm-rename/llama.ts b/src/plugins/local-llm-rename/llama.ts
index 717fc76..fa16eb3 100644
--- a/src/plugins/local-llm-rename/llama.ts
+++ b/src/plugins/local-llm-rename/llama.ts
@@ -22,23 +22,95 @@ export async function llama(opts: {
   disableGpu?: boolean;
 }): Promise<Prompt> {
   const disableGpu = opts.disableGpu ?? IS_CI;
+
+  // Ref:
+  //   https://node-llama-cpp.withcat.ai/api/functions/getLlama
+  //   https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions
+  //   https://node-llama-cpp.withcat.ai/api/classes/Llama
   const llama = await getLlama({ gpu: disableGpu ? false : "auto" });
+
+  console.log(
+    "Llama Debug Info:",
+    {
+      gpu: llama.gpu,
+      supportsGpuOffloading: llama.supportsGpuOffloading,
+      supportsMmap: llama.supportsMmap,
+      gpuSupportsMmap: llama.gpuSupportsMmap,
+      supportsMlock: llama.supportsMlock,
+      cpuMathCores: llama.cpuMathCores,
+      maxThreads: llama.maxThreads,
+      logLevel: llama.logLevel,
+      buildType: llama.buildType,
+      cmakeOptions: llama.cmakeOptions,
+      llamaCppRelease: llama.llamaCppRelease,
+      systemInfo: llama.systemInfo,
+      vramPaddingSize: llama.vramPaddingSize,
+      getVramState: await llama.getVramState(),
+      // getSwapState: await llama.getSwapState(),
+      getGpuDeviceNames: await llama.getGpuDeviceNames(),
+    }
+  );
+
+  // Ref:
+  //   https://node-llama-cpp.withcat.ai/api/classes/Llama#loadmodel
+  //   https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions
+  //   https://node-llama-cpp.withcat.ai/api/classes/LlamaModel
   const modelOpts: LlamaModelOptions = {
     modelPath: getModelPath(opts?.model),
-    gpuLayers: disableGpu ? 0 : undefined
+    gpuLayers: disableGpu ? 0 : undefined,
+    onLoadProgress: (loadProgress) => {
+      const percent = (loadProgress * 100).toFixed(2);
+      console.log(`LlamaModel::onLoadProgress: ${percent}%`);
+    }
   };
   verbose.log("Loading model with options", modelOpts);
   const model = await llama.loadModel(modelOpts);
 
+  console.log(
+    "LlamaModel Debug Info:",
+    {
+      filename: model.filename,
+      fileInfo: model.fileInfo,
+      size: model.size,
+      flashAttentionSupported: model.flashAttentionSupported,
+      defaultContextFlashAttention: model.defaultContextFlashAttention,
+      trainContextSize: model.trainContextSize,
+      embeddingVectorSize: model.embeddingVectorSize,
+      vocabularyType: model.vocabularyType,
+      getWarnings: model.getWarnings(),
+    }
+  );
+
+  // Ref:
+  //   https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#createcontext
+  //   https://node-llama-cpp.withcat.ai/api/classes/LlamaContext
   const context = await model.createContext({ seed: opts?.seed });
 
+  console.log(
+    "LlamaContext Debug Info:",
+    {
+      contextSize: context.contextSize,
+      batchSize: context.batchSize,
+      flashAttention: context.flashAttention,
+      stateSize: context.stateSize,
+      currentThreads: context.currentThreads,
+      idealThreads: context.idealThreads,
+      totalSequences: context.totalSequences,
+      sequencesLeft: context.sequencesLeft,
+    }
+  );
+
   return async (systemPrompt, userPrompt, responseGrammar) => {
+    // Ref:
+    //   https://node-llama-cpp.withcat.ai/api/classes/LlamaChatSession
+    //   https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaChatSessionOptions
     const session = new LlamaChatSession({
       contextSequence: context.getSequence(),
       autoDisposeSequence: true,
       systemPrompt,
       chatWrapper: getModelWrapper(opts.model)
     });
+
     const response = await session.promptWithMeta(userPrompt, {
       temperature: 0.8,
       grammar: new LlamaGrammar(llama, {
@@ -46,7 +118,9 @@ export async function llama(opts: {
       }),
       stopOnAbortSignal: true
     });
+
     session.dispose();
+
     return responseGrammar.parseResult(response.responseText);
   };
 }

And then I was running like this:

⇒ npm start -- local --verbose samples/foo.js 2>&1 | subl

But on my Intel mac it doesn't register a GPU, so hard to debug much further:

Llama Debug Info: {
  gpu: false,
  supportsGpuOffloading: false,
  supportsMmap: true,
  gpuSupportsMmap: undefined,
  supportsMlock: true,
  cpuMathCores: undefined,
  maxThreads: undefined,
  logLevel: 'warn',
  buildType: 'prebuilt',
  cmakeOptions: {},
  llamaCppRelease: { repo: 'ggerganov/llama.cpp', release: 'b3543' },
  systemInfo: 'AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ',
  vramPaddingSize: 0,
  getVramState: { total: 0, used: 0, free: 0 },
  getGpuDeviceNames: []
}
<!-- gh-comment-id:2747073701 --> @0xdevalias commented on GitHub (Mar 24, 2025): Here's a debug patch I put together to get a bit more info about the various extra bits of info we could inspect: ```diff diff --git a/src/plugins/local-llm-rename/llama.ts b/src/plugins/local-llm-rename/llama.ts index 717fc76..fa16eb3 100644 --- a/src/plugins/local-llm-rename/llama.ts +++ b/src/plugins/local-llm-rename/llama.ts @@ -22,23 +22,95 @@ export async function llama(opts: { disableGpu?: boolean; }): Promise<Prompt> { const disableGpu = opts.disableGpu ?? IS_CI; + + // Ref: + // https://node-llama-cpp.withcat.ai/api/functions/getLlama + // https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaOptions + // https://node-llama-cpp.withcat.ai/api/classes/Llama const llama = await getLlama({ gpu: disableGpu ? false : "auto" }); + + console.log( + "Llama Debug Info:", + { + gpu: llama.gpu, + supportsGpuOffloading: llama.supportsGpuOffloading, + supportsMmap: llama.supportsMmap, + gpuSupportsMmap: llama.gpuSupportsMmap, + supportsMlock: llama.supportsMlock, + cpuMathCores: llama.cpuMathCores, + maxThreads: llama.maxThreads, + logLevel: llama.logLevel, + buildType: llama.buildType, + cmakeOptions: llama.cmakeOptions, + llamaCppRelease: llama.llamaCppRelease, + systemInfo: llama.systemInfo, + vramPaddingSize: llama.vramPaddingSize, + getVramState: await llama.getVramState(), + // getSwapState: await llama.getSwapState(), + getGpuDeviceNames: await llama.getGpuDeviceNames(), + } + ); + + // Ref: + // https://node-llama-cpp.withcat.ai/api/classes/Llama#loadmodel + // https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaModelOptions + // https://node-llama-cpp.withcat.ai/api/classes/LlamaModel const modelOpts: LlamaModelOptions = { modelPath: getModelPath(opts?.model), - gpuLayers: disableGpu ? 0 : undefined + gpuLayers: disableGpu ? 0 : undefined, + onLoadProgress: (loadProgress) => { + const percent = (loadProgress * 100).toFixed(2); + console.log(`LlamaModel::onLoadProgress: ${percent}%`); + } }; verbose.log("Loading model with options", modelOpts); const model = await llama.loadModel(modelOpts); + console.log( + "LlamaModel Debug Info:", + { + filename: model.filename, + fileInfo: model.fileInfo, + size: model.size, + flashAttentionSupported: model.flashAttentionSupported, + defaultContextFlashAttention: model.defaultContextFlashAttention, + trainContextSize: model.trainContextSize, + embeddingVectorSize: model.embeddingVectorSize, + vocabularyType: model.vocabularyType, + getWarnings: model.getWarnings(), + } + ); + + // Ref: + // https://node-llama-cpp.withcat.ai/api/classes/LlamaModel#createcontext + // https://node-llama-cpp.withcat.ai/api/classes/LlamaContext const context = await model.createContext({ seed: opts?.seed }); + console.log( + "LlamaContext Debug Info:", + { + contextSize: context.contextSize, + batchSize: context.batchSize, + flashAttention: context.flashAttention, + stateSize: context.stateSize, + currentThreads: context.currentThreads, + idealThreads: context.idealThreads, + totalSequences: context.totalSequences, + sequencesLeft: context.sequencesLeft, + } + ); + return async (systemPrompt, userPrompt, responseGrammar) => { + // Ref: + // https://node-llama-cpp.withcat.ai/api/classes/LlamaChatSession + // https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaChatSessionOptions const session = new LlamaChatSession({ contextSequence: context.getSequence(), autoDisposeSequence: true, systemPrompt, chatWrapper: getModelWrapper(opts.model) }); + const response = await session.promptWithMeta(userPrompt, { temperature: 0.8, grammar: new LlamaGrammar(llama, { @@ -46,7 +118,9 @@ export async function llama(opts: { }), stopOnAbortSignal: true }); + session.dispose(); + return responseGrammar.parseResult(response.responseText); }; } ``` And then I was running like this: ```shell ⇒ npm start -- local --verbose samples/foo.js 2>&1 | subl ``` But on my Intel mac it doesn't register a GPU, so hard to debug much further: ``` Llama Debug Info: { gpu: false, supportsGpuOffloading: false, supportsMmap: true, gpuSupportsMmap: undefined, supportsMlock: true, cpuMathCores: undefined, maxThreads: undefined, logLevel: 'warn', buildType: 'prebuilt', cmakeOptions: {}, llamaCppRelease: { repo: 'ggerganov/llama.cpp', release: 'b3543' }, systemInfo: 'AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ', vramPaddingSize: 0, getVramState: { total: 0, used: 0, free: 0 }, getGpuDeviceNames: [] } ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/humanify#71
No description provided.