[GH-ISSUE #97] More deterministic renames across different versions of the same code #42

Open
opened 2026-03-03 13:52:27 +03:00 by kerem · 12 comments
Owner

Originally created by @neoOpus on GitHub (Sep 12, 2024).
Original GitHub issue: https://github.com/jehna/humanify/issues/97

Hi,

I have an idea that I hope will be helpful and prompt some discussion.

Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc.

My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives, updated continuously. This would serve as a lookup table for the LLM, helping maintain consistency and reducing variations in the final output. Admittedly, I haven't fully explored the feasibility of this concept, but I believe it would strengthen reverse-engineering processes.

Originally created by @neoOpus on GitHub (Sep 12, 2024). Original GitHub issue: https://github.com/jehna/humanify/issues/97 Hi, I have an idea that I hope will be helpful and prompt some discussion. Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc. My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives, updated continuously. This would serve as a lookup table for the LLM, helping maintain consistency and reducing variations in the final output. Admittedly, I haven't fully explored the feasibility of this concept, but I believe it would strengthen reverse-engineering processes.
Author
Owner

@0xdevalias commented on GitHub (Sep 13, 2024):

My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives

@neoOpus This is similar to an area I have spent a fair bit of time thinking about/prototyping tooling around in the past. One of the bigger issues that you're likely to find here is that with bundlers like webpack/etc, when they minimise the variable names, they won't necessarily choose the same minified variable name for the same code each time. So to make a 'lookup table' type concept work, you first need to be able to stabilise the 'reference key' for each of those variables, even if the bundler chose something different to represent it.

You can find some of my initial hacky prototypes scattered in this repo:

My thoughts/notes on this are scattered around a few places, but these may be some useful/interesting places to start:

You can see an example of a larger scale project where I was trying to stabilise the minified variable names to reduce the 'noise' in large scale source diffing here:


(Edit: I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)

<!-- gh-comment-id:2347878686 --> @0xdevalias commented on GitHub (Sep 13, 2024): > My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives @neoOpus This is similar to an area I have spent a fair bit of time thinking about/prototyping tooling around in the past. One of the bigger issues that you're likely to find here is that with bundlers like webpack/etc, when they minimise the variable names, they won't necessarily choose the same minified variable name for the same code each time. So to make a 'lookup table' type concept work, you first need to be able to stabilise the 'reference key' for each of those variables, even if the bundler chose something different to represent it. You can find some of my initial hacky prototypes scattered in this repo: - https://github.com/0xdevalias/poc-ast-tools My thoughts/notes on this are scattered around a few places, but these may be some useful/interesting places to start: - https://github.com/0xdevalias/chatgpt-source-watch/issues/3 - https://github.com/Wilfred/difftastic/issues/631 - https://github.com/afnanenayet/diffsitter/issues/819 - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js - https://github.com/0xdevalias/chatgpt-source-watch/issues/10 - https://github.com/pionxzh/wakaru/issues/34 - https://github.com/pionxzh/wakaru/issues/74 - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#variable-name-mangling - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#my-chatgpt-research--conversations - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#fingerprinting-minified-javascript-libraries - https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc - https://github.com/pionxzh/wakaru/issues/73 - https://github.com/pionxzh/wakaru/issues/41 - https://github.com/j4k0xb/webcrack/issues/21 You can see an example of a larger scale project where I was trying to stabilise the minified variable names to reduce the 'noise' in large scale source diffing here: - https://github.com/0xdevalias/chatgpt-source-watch --- _(**Edit:** I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)_
Author
Owner

@neoOpus commented on GitHub (Sep 14, 2024):

Thank you, Glenn, for taking the time to answer me in detail...

I've wrote a message before this one, but it got lost...

I am going through the links you just shared, and I will get back to you with some ideas, I think I already have some that are worth discussing but I want to make sure before that that I have valid and viable ones as my knowledge is still very limited in this area.

<!-- gh-comment-id:2351211877 --> @neoOpus commented on GitHub (Sep 14, 2024): Thank you, Glenn, for taking the time to answer me in detail... I've wrote a message before this one, but it got lost... I am going through the links you just shared, and I will get back to you with some ideas, I think I already have some that are worth discussing but I want to make sure before that that I have valid and viable ones as my knowledge is still very limited in this area.
Author
Owner

@jehna commented on GitHub (Sep 17, 2024):

Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc.

Just to clarify that I'm on the same page here, is the issue that:

  • You have multiple versions of a webapp/website that change over time
  • You un-minify all of them
  • You need to compare their differencies, and it's proving difficult as Humanify does not generate same names for same minified code

This is an interesting problem. I'd love to research some ways to implement this. Especially AST fingerprinting seems promising, thank you @0xdevalias for your links.

<!-- gh-comment-id:2356724015 --> @jehna commented on GitHub (Sep 17, 2024): > Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc. Just to clarify that I'm on the same page here, is the issue that: * You have multiple versions of a webapp/website that change over time * You un-minify all of them * You need to compare their differencies, and it's proving difficult as Humanify does not generate same names for same minified code This is an interesting problem. I'd love to research some ways to implement this. Especially AST fingerprinting seems promising, thank you @0xdevalias for your links.
Author
Owner

@jehna commented on GitHub (Sep 17, 2024):

One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.

In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.

<!-- gh-comment-id:2356732911 --> @jehna commented on GitHub (Sep 17, 2024): One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.
Author
Owner

@neoOpus commented on GitHub (Sep 18, 2024):

I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful!

My suggestion is to break the code down into smaller, modular functions, which seems to be a practice your script might already be implementing. One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it).

Anyway, this would allow for a standardized, minified version of the code. After creating this stripped down and abstracted version, we could calculate a hash of the code as a string. This hash would serve as a unique identifier to track changes portions of the code from different versions of the project and prevent duplicate entries as well as a reference to where to store the future generated variable names. The resulting data could be stored in an appropriate format, such as CSV, NoSQL, or JSON, based on your requirements for speed, scalability, and ease of access.

Next, we could analyze this stored data from a designated project location or a maybe specified subfolder (into .humanifjs). Here, we could leverage language models (LLMs) to generate meaningful variable names based on the context of the functions. This would create a "reference" that can assist in future analyses of the code.

When new versions of the obfuscated code are generated (which will have different variable names), we can apply a similar process to compare them with previously processed versions. By using diff techniques, we can identify changes and maintain a collection of these sub-chunks of code, which would help reduce discrepancies. In most cases, we should see a high degree of similarity unless a particular function’s logic has altered. We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks.

Additionally, to enhance this process, we could explore various optimizations in how the LLM generates and assigns these variable names, as well as how we handle the storage and retrieval of the chunks.

I look forward to your thoughts on this approach and any suggestions you may have for improving it further!

What would make this work better is to make it able to work take advantage of diff (compare) technics to make some sort of sub-chuncks then keeping them available to reduce the discrepancy, maybe also optimize the generation... I hope this makes sense.

And as you stated here

One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.

In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.

This would be optimal indeed as it will allow to leverage the collective work to get the best results.

PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly.

<!-- gh-comment-id:2359434509 --> @neoOpus commented on GitHub (Sep 18, 2024): I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful! My suggestion is to break the code down into smaller, modular functions, which seems to be a practice your script might already be implementing. One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it). Anyway, this would allow for a standardized, minified version of the code. After creating this stripped down and abstracted version, we could calculate a hash of the code as a string. This hash would serve as a unique identifier to track changes portions of the code from different versions of the project and prevent duplicate entries as well as a reference to where to store the future generated variable names. The resulting data could be stored in an appropriate format, such as CSV, NoSQL, or JSON, based on your requirements for speed, scalability, and ease of access. Next, we could analyze this stored data from a designated project location or a maybe specified subfolder (into .humanifjs). Here, we could leverage language models (LLMs) to generate meaningful variable names based on the context of the functions. This would create a "reference" that can assist in future analyses of the code. When new versions of the obfuscated code are generated (which will have different variable names), we can apply a similar process to compare them with previously processed versions. By using diff techniques, we can identify changes and maintain a collection of these sub-chunks of code, which would help reduce discrepancies. In most cases, we should see a high degree of similarity unless a particular function’s logic has altered. We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks. Additionally, to enhance this process, we could explore various optimizations in how the LLM generates and assigns these variable names, as well as how we handle the storage and retrieval of the chunks. I look forward to your thoughts on this approach and any suggestions you may have for improving it further! What would make this work better is to make it able to work take advantage of diff (compare) technics to make some sort of sub-chuncks then keeping them available to reduce the discrepancy, maybe also optimize the generation... I hope this makes sense. And as you stated here > One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. > > In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. This would be optimal indeed as it will allow to leverage the collective work to get the best results. PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly.
Author
Owner

@0xdevalias commented on GitHub (Sep 25, 2024):

One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.

@jehna Agreed. This was one of the ideas that first led me down the 'fingerprinting' path. Though instead of 'deterministically reversing the code to the original source' in its entirety (which may also be useful), my plan was first to be able to detect dependencies and mark them as such (as most of the time I don't care to look too deeply at them), and then secondly to just be able to extract the 'canonical variable/function names' from that original source and be able to apply them to my unminified version (similar to how humanify currently uses AI for this step); as that way I know that even if there is some little difference in the actual included code, I won't lose that by replacing it with the original source. These issues on wakaru are largely based on this area of things:

While it's a very minimal/naive attempt, and definitely not the most robust way to approach things, a while back I implemented a really basic 'file fingerprint' method, mostly to assist in figuring out when a chunk had been renamed (but was otherwise largely the same chunk as before), that I just pushed to poc-ast-tools (github.com/0xdevalias/poc-ast-tools@b0ef60f860):

When I was implementing it, I was thinking about embeddings, but didn't want to have to send large files to the OpenAI embeddings API; and wanted a quick/simple local approximation of it.

Expanding on this concept to the more general code fingerprinting problem; I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed. I would also probably be normalising the code to remove any function/variable identifiers first; and to remove the impact of whitespace differences/etc.

While it's not applied to generating a fingerprint, you can see how I've used some of these techniques in my approach to creating a 'diff minimiser' for identifying newly changed code between builds, while ignoring the 'minification noise / churn':


In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.

@jehna Oh true.. yeah, that definitely makes sense. Kind of like a local cache.


One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it).

@neoOpus This would be handled by parsing the code into an AST, and then manipulating that AST to rename the variables.

You can see various hacky PoC versions of this with various parsers in my poc-ast-tools repo (I don't remember which is the best/most canonical as I haven't looked at it all for ages), eg:

Which you can see some of the early hacky mapping attempts I was making in these files:

That was the point where I realised I really needed something more robust (such as a proper fingerprint that would survive code minification) to use as the key.


We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks.

@neoOpus Re-applying the old variable names to the new code wouldn't need an LLM at all, as that part is handled in the AST processing code within humanify:

  • https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification#:~:text=Don%27t%20let%20AI%20touch%20the%20code
    • Don't let AI touch the code
      Now while LLMs are very good at rephrasing and summarizing, they are not very good at coding (yet). They have inherent randomness, which makes them unsuitable for performing the actual renaming and modification of the code.

      Fortunately renaming a Javascript variable within its scope is a solved problem with traditional tools like Babel. Babel first parses the code into an abstract syntax tree (AST, a machine representation of the code), which is easy to modify using well behaving algorithms.

      This is much better than letting the LLM modify the code on a text level; it ensures that only very specific transformations are carried out so the code's functionality does not change after the renaming. The code is guaranteed to have the original functionality and to be runnable by the computer.


I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful!

@neoOpus At a high level, it seems that the thinking/aspects you've outlined here are more or less in line with what I've discussed previously in the resources I linked to in my first comment above.


PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly.

@neoOpus IMO, the bulk of the 'harder parts' of implementing this aren't really LLM related, and shouldn't require a powerful machine. The areas I would suggest most looking into around this are how AST parsing/manipulation works; and then how to create a robust/stable fingerprinting method.

IMO, figuring the ideal method of fingerprinting is probably the largest / potentially hardest 'unknown' in all of this currently (at least to me, since while I started to gather resources for it, I haven't had the time to deep dive into reading/analysing them all):

Off the top of my head, I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed; and then generate fingerprints for them.

I would also potentially consider looking at the module/function 'entry/exit' points (eg. imports/exports); or maybe even the entire 'shape' of the module import graph itself.

I would also probably be normalising the code to remove any function/variable identifiers and to remove the impact of whitespace differences/etc; before generating any fingerprints on it.

Another potential method I considered for the fingerprints is identifying the types of elements that tend to remain stable even when minified, and using those as part of the fingerprint. As that is one of the manual methods I used to be able to identify a number of the modules listed here:


(Edit: I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)

<!-- gh-comment-id:2372638981 --> @0xdevalias commented on GitHub (Sep 25, 2024): > One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. @jehna Agreed. This was one of the ideas that first led me down the 'fingerprinting' path. Though instead of 'deterministically reversing the code to the original source' in its entirety (which may also be useful), my plan was first to be able to detect dependencies and mark them as such (as most of the time I don't care to look too deeply at them), and then secondly to just be able to extract the 'canonical variable/function names' from that original source and be able to apply them to my unminified version (similar to how `humanify` currently uses AI for this step); as that way I know that even if there is some little difference in the actual included code, I won't lose that by replacing it with the original source. These issues on `wakaru` are largely based on this area of things: - https://github.com/pionxzh/wakaru/issues/41 - https://github.com/pionxzh/wakaru/issues/73 - https://github.com/pionxzh/wakaru/issues/74 While it's a very minimal/naive attempt, and definitely not the most robust way to approach things, a while back I implemented a really basic 'file fingerprint' method, mostly to assist in figuring out when a chunk had been renamed (but was otherwise largely the same chunk as before), that I just pushed to `poc-ast-tools` (https://github.com/0xdevalias/poc-ast-tools/commit/b0ef60f8608385c40de2644b3346b1834eb477a0): - https://github.com/0xdevalias/poc-ast-tools/blob/main/text_similarity_checker.py - https://github.com/0xdevalias/poc-ast-tools/blob/main/rename-chunk.sh When I was implementing it, I was thinking about embeddings, but didn't want to have to send large files to the OpenAI embeddings API; and wanted a quick/simple local approximation of it. Expanding on this concept to the more general code fingerprinting problem; I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed. I would also probably be normalising the code to remove any function/variable identifiers first; and to remove the impact of whitespace differences/etc. While it's not applied to generating a fingerprint, you can see how I've used some of these techniques in my approach to creating a 'diff minimiser' for identifying newly changed code between builds, while ignoring the 'minification noise / churn': - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js --- > In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. @jehna Oh true.. yeah, that definitely makes sense. Kind of like a local cache. --- > One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it). @neoOpus This would be handled by parsing the code into an AST, and then manipulating that AST to rename the variables. You can see various hacky PoC versions of this with various parsers in my `poc-ast-tools` repo (I don't remember which is the best/most canonical as I haven't looked at it all for ages), eg: - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_0_old_combined.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_1.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_2.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3_clean.js - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3_cli.js - etc: https://github.com/0xdevalias/poc-ast-tools Which you can see some of the early hacky mapping attempts I was making in these files: - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-121de668c4456907-HEAD.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD-rewritten.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD%5E1.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-f9af0280d3150ee2-HEAD.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-test.json - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.json That was the point where I realised I really needed something more robust (such as a proper fingerprint that would survive code minification) to use as the key. --- > We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks. @neoOpus Re-applying the old variable names to the new code wouldn't need an LLM at all, as that part is handled in the AST processing code within `humanify`: - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification#:~:text=Don%27t%20let%20AI%20touch%20the%20code - > Don't let AI touch the code > Now while LLMs are very good at rephrasing and summarizing, they are not very good at coding (yet). They have inherent randomness, which makes them unsuitable for performing the actual renaming and modification of the code. > > Fortunately renaming a Javascript variable within its scope is a solved problem with traditional tools like Babel. Babel first parses the code into an abstract syntax tree (AST, a machine representation of the code), which is easy to modify using well behaving algorithms. > > This is much better than letting the LLM modify the code on a text level; it ensures that only very specific transformations are carried out so the code's functionality does not change after the renaming. The code is guaranteed to have the original functionality and to be runnable by the computer. --- > I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful! @neoOpus At a high level, it seems that the thinking/aspects you've outlined here are more or less in line with what I've discussed previously in the resources I linked to [in my first comment above](https://github.com/jehna/humanify/issues/97#issuecomment-2347878686). --- > PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly. @neoOpus IMO, the bulk of the 'harder parts' of implementing this aren't really LLM related, and shouldn't require a powerful machine. The areas I would suggest most looking into around this are how AST parsing/manipulation works; and then how to create a robust/stable fingerprinting method. IMO, figuring the ideal method of fingerprinting is probably the largest / potentially hardest 'unknown' in all of this currently (at least to me, since while I started to gather resources for it, I haven't had the time to deep dive into reading/analysing them all): - https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#fingerprinting-minified-javascript-libraries Off the top of my head, I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed; and then generate fingerprints for them. I would also potentially consider looking at the module/function 'entry/exit' points (eg. imports/exports); or maybe even the entire 'shape' of the module import graph itself. I would also probably be normalising the code to remove any function/variable identifiers and to remove the impact of whitespace differences/etc; before generating any fingerprints on it. Another potential method I considered for the fingerprints is identifying the types of elements that tend to remain stable even when minified, and using those as part of the fingerprint. As that is one of the manual methods I used to be able to identify a number of the modules listed here: - https://github.com/pionxzh/wakaru/issues/41 - https://github.com/pionxzh/wakaru/issues/40 - https://github.com/pionxzh/wakaru/issues/79 - https://github.com/pionxzh/wakaru/issues/88 - https://github.com/pionxzh/wakaru/issues/89 - https://github.com/pionxzh/wakaru/issues/87 - etc: https://github.com/pionxzh/wakaru/issues?q=%22%5Bmodule-detection%5D%22 --- _(**Edit:** I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)_
Author
Owner

@0xdevalias commented on GitHub (Oct 21, 2024):

Resume-ability would also be a good thing to consider.

Some of the discussion in the following issue could tangentially relate to resumability (specifically if a consistent 'map' of renames was created, perhaps that could also show which sections of the code hadn't yet been processed):

Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/167#issuecomment-2425538385

<!-- gh-comment-id:2425538789 --> @0xdevalias commented on GitHub (Oct 21, 2024): > > Resume-ability would also be a good thing to consider. > > Some of the discussion in the following issue could tangentially relate to resumability (specifically if a consistent 'map' of renames was created, perhaps that could also show which sections of the code hadn't yet been processed): > > - https://github.com/jehna/humanify/issues/97 > > _Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/167#issuecomment-2425538385_
Author
Owner

@0xdevalias commented on GitHub (Mar 10, 2025):

FYI, there is some recent upstream discussion happening in webcrack around stable identifier renaming:

  • https://github.com/j4k0xb/webcrack/issues/154
    • An idea I had was hashing various attributes of a variable like:

      • the initialization value
      • count usages
      • general location (which function it's in)

      With the example from above:

      // input
      var a = 100, b = 500, c = 1000;
      // output
      var v100_0_g = 100;
      var v500_0_g = 500;
      var v1000_0_g = 1000;
      
      // input
      var a = 1, b = 100, c = 500, d = 1000;
      // output
      var v1_0_g = 1; // only changed line!
      var v100_0_g = 100;
      var v500_0_g = 500;
      var v1000_0_g = 1000;
      

      Where the format is v${initialValue}_${usages}_${scope} (scope = "g"lobal). Of course this is a very naive example, real world would probably involve a hash.

      Originally posted by @Le0Developer in https://github.com/j4k0xb/webcrack/issues/154#issue-2895194646

As well as some related PR's:

<!-- gh-comment-id:2709676099 --> @0xdevalias commented on GitHub (Mar 10, 2025): FYI, there is some recent upstream discussion happening in `webcrack` around stable identifier renaming: - https://github.com/j4k0xb/webcrack/issues/154 - > An idea I had was hashing various attributes of a variable like: > > - the initialization value > - count usages > - general location (which function it's in) > > With the example from above: > > ```js > // input > var a = 100, b = 500, c = 1000; > // output > var v100_0_g = 100; > var v500_0_g = 500; > var v1000_0_g = 1000; > > // input > var a = 1, b = 100, c = 500, d = 1000; > // output > var v1_0_g = 1; // only changed line! > var v100_0_g = 100; > var v500_0_g = 500; > var v1000_0_g = 1000; > ``` > > Where the format is `v${initialValue}_${usages}_${scope}` (scope = "g"lobal). Of course this is a very naive example, real world would probably involve a hash. > > _Originally posted by @Le0Developer in https://github.com/j4k0xb/webcrack/issues/154#issue-2895194646_ As well as some related PR's: - https://github.com/j4k0xb/webcrack/pull/155
Author
Owner

@Le0Developer commented on GitHub (Mar 10, 2025):

Originally posted by @Le0Developer in "stable" identifier demangling j4k0xb/webcrack#154 (comment)

Can you please not mention me everywhere?

<!-- gh-comment-id:2710027447 --> @Le0Developer commented on GitHub (Mar 10, 2025): > _Originally posted by [@Le0Developer](https://github.com/Le0Developer) in ["stable" identifier demangling j4k0xb/webcrack#154 (comment)](https://github.com/j4k0xb/webcrack/issues/154#issue-2895194646)_ Can you please not mention me everywhere?
Author
Owner

@0xdevalias commented on GitHub (Mar 10, 2025):

Can you please not mention me everywhere?

I was using GitHub's 'reference in another issue', which includes it; wanted to give credit to the source of the information I was referring to; and also figured that the information in this thread may have been relevant/helpful for the issue you raised in webcrack.

But sure, I will try to keep that in mind for future.

<!-- gh-comment-id:2710444059 --> @0xdevalias commented on GitHub (Mar 10, 2025): > Can you please not mention me everywhere? I was using GitHub's 'reference in another issue', which includes it; wanted to give credit to the source of the information I was referring to; and also figured that the information in this thread may have been relevant/helpful for the issue you raised in `webcrack`. But sure, I will try to keep that in mind for future.
Author
Owner

@neoOpus commented on GitHub (Mar 12, 2025):

@0xdevalias Thank you for keeping me updated on this...

I still have problem with the rate limitations... So I didn't do anything useful with this yet... But I have discussed with some LLMs about several ideas, issues, etc regarding this and I think we can get something more more robust.

I hope soon I will be able to work on some these ideas.

<!-- gh-comment-id:2716499049 --> @neoOpus commented on GitHub (Mar 12, 2025): @0xdevalias Thank you for keeping me updated on this... I still have problem with the rate limitations... So I didn't do anything useful with this yet... But I have discussed with some LLMs about several ideas, issues, etc regarding this and I think we can get something more more robust. I hope soon I will be able to work on some these ideas.
Author
Owner

@0xdevalias commented on GitHub (Mar 12, 2025):

But I have discussed with some LLMs about several ideas, issues, etc regarding this and I think we can get something more more robust.

@neoOpus Awesome! I'd definitely be interested to hear what your thoughts/ideas/issues/etc are related to this if you're open to sharing them / when you get a chance.

I hope soon I will be able to work on some these ideas.

@neoOpus And/or I look forward to seeing what you come up with when you do get to work on them!

<!-- gh-comment-id:2716992126 --> @0xdevalias commented on GitHub (Mar 12, 2025): > But I have discussed with some LLMs about several ideas, issues, etc regarding this and I think we can get something more more robust. @neoOpus Awesome! I'd definitely be interested to hear what your thoughts/ideas/issues/etc are related to this if you're open to sharing them / when you get a chance. > I hope soon I will be able to work on some these ideas. @neoOpus And/or I look forward to seeing what you come up with when you do get to work on them!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/humanify#42
No description provided.