[GH-ISSUE #1655] Should ArchiveBox be using the GPL2 due to mutagen dependency? #4005

Closed
opened 2026-03-15 01:16:12 +03:00 by kerem · 4 comments
Owner

Originally created by @erwin on GitHub (Feb 10, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1655

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

I noticed that you're using the GPL2 licensed "mutagen" to grab wav/flac/mp3 audio file info:

def get_audio_size(audiopath):
    extension = audiopath.rpartition(".")[-1].lower()
    if extension not in {"mp3", "wav", "flac"}:
        raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!")

    audio_info = mutagen.File(audiopath).info
    return int(audio_info.length * audio_info.sample_rate)

Since Mutagen is GPL2 licensed, and this is runtime use of the GPL2 licensed code directly by your app, I believe that you are also obligated to use the GPL2 license.

I believe that MIT licensed tinytag should provide similar functions, if you're interested in preserving your MIT license.

Hopefully that's helpful to you! If you need any help with that, let me know.

Steps to reproduce

Issue is related to licensing.

Logs or errors

No log steps are necessary.

ArchiveBox Version

v0.8.5

How did you install the version of ArchiveBox you are using?

Other

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

  • some of data/ is on a local SSD or NVMe drive
  • some of data/ is on a spinning hard drive or external USB drive
  • some of data/ is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.)
  • some of data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.)

Docker Compose Configuration


ArchiveBox Configuration


Originally created by @erwin on GitHub (Feb 10, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1655 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug I noticed that you're using the GPL2 licensed "mutagen" to grab wav/flac/mp3 audio file info: ``` def get_audio_size(audiopath): extension = audiopath.rpartition(".")[-1].lower() if extension not in {"mp3", "wav", "flac"}: raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!") audio_info = mutagen.File(audiopath).info return int(audio_info.length * audio_info.sample_rate) ``` Since [Mutagen]( https://github.com/quodlibet/mutagen) is GPL2 licensed, and this is runtime use of the GPL2 licensed code directly by your app, I believe that you are also obligated to use the GPL2 license. I believe that MIT licensed [tinytag](https://github.com/tinytag/tinytag) should provide similar functions, if you're interested in preserving your MIT license. Hopefully that's helpful to you! If you need any help with that, let me know. ### Steps to reproduce ```markdown Issue is related to licensing. ``` ### Logs or errors ```shell No log steps are necessary. ``` ### ArchiveBox Version ```shell v0.8.5 ``` ### How did you install the version of ArchiveBox you are using? Other ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [x] some of `data/` is on a local SSD or NVMe drive - [ ] some of `data/` is on a spinning hard drive or external USB drive - [ ] some of `data/` is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.) - [ ] some of `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.) ### Docker Compose Configuration ```shell ``` ### ArchiveBox Configuration ```shell ```
Author
Owner

@pirate commented on GitHub (Feb 10, 2025):

ok added to backlog.

I'd gladly take a PR if anyone wants to take a crack at it (o3 or Cladue could probably swap it out easily).

<!-- gh-comment-id:2647123551 --> @pirate commented on GitHub (Feb 10, 2025): ok added to backlog. I'd gladly take a PR if anyone wants to take a crack at it (o3 or Cladue could probably swap it out easily).
Author
Owner

@Intralexical commented on GitHub (Feb 10, 2025):

You have a GPLed program that I'd like to link with my code to build a proprietary program. Does the fact that I link with your program mean I have to GPL my program? (#LinkingWithGPL)

Not exactly. It means you must release your program under a license compatible with the GPL (more precisely, compatible with one or more GPL versions accepted by all the rest of the code in the combination that you link). The combination itself is then available under those GPL versions.

I think MIT should be fine.

But downstream proprietary versions that bundle Mutagen might not.

What is the difference between an “aggregate” and other kinds of “modified versions”? (#MereAggregation)

An “aggregate” consists of a number of separate programs, distributed together on the same CD-ROM or other media. The GPL permits you to create and distribute an aggregate, even when the licenses of the other software are nonfree or GPL-incompatible. The only condition is that you cannot release the aggregate under a license that prohibits users from exercising rights that each program's individual license would grant them.

Where's the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide. We believe that a proper criterion depends both on the mechanism of communication (exec, pipes, rpc, function calls within a shared address space, etc.) and the semantics of the communication (what kinds of information are interchanged).

If the modules are included in the same executable file, they are definitely combined in one program. If modules are designed to run linked together in a shared address space, that almost surely means combining them into one program.

By contrast, pipes, sockets and command-line arguments are communication mechanisms normally used between two separate programs. So when they are used for communication, the modules normally are separate programs. But if the semantics of the communication are intimate enough, exchanging complex internal data structures, that too could be a basis to consider the two parts as combined into a larger program.

You can also just cut and paste any code that interacts with Mutagen into its own module, and run it with subprocess.run() instead of import.

<!-- gh-comment-id:2649096589 --> @Intralexical commented on GitHub (Feb 10, 2025): > **You have a GPLed program that I'd like to link with my code to build a proprietary program. Does the fact that I link with your program mean I have to GPL my program?** ([#LinkingWithGPL](https://www.gnu.org/licenses/gpl-faq.html#LinkingWithGPL)) > > Not exactly. It means you must release your program under a license compatible with the GPL (more precisely, compatible with one or more GPL versions accepted by all the rest of the code in the combination that you link). The combination itself is then available under those GPL versions. I think MIT should be fine. But downstream proprietary versions that bundle Mutagen might not. > **What is the difference between an “aggregate” and other kinds of “modified versions”?** ([#MereAggregation](https://www.gnu.org/licenses/gpl-faq.html#MereAggregation)) > > An “aggregate” consists of a number of separate programs, distributed together on the same CD-ROM or other media. The GPL permits you to create and distribute an aggregate, even when the licenses of the other software are nonfree or GPL-incompatible. The only condition is that you cannot release the aggregate under a license that prohibits users from exercising rights that each program's individual license would grant them. > > Where's the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide. We believe that a proper criterion depends both on the mechanism of communication (exec, pipes, rpc, function calls within a shared address space, etc.) and the semantics of the communication (what kinds of information are interchanged). > > If the modules are included in the same executable file, they are definitely combined in one program. If modules are designed to run linked together in a shared address space, that almost surely means combining them into one program. > > By contrast, pipes, sockets and command-line arguments are communication mechanisms normally used between two separate programs. So when they are used for communication, the modules normally are separate programs. But if the semantics of the communication are intimate enough, exchanging complex internal data structures, that too could be a basis to consider the two parts as combined into a larger program. You can also just cut and paste any code that interacts with Mutagen into its own module, and run it with `subprocess.run()` instead of `import`.
Author
Owner

@Intralexical commented on GitHub (Feb 10, 2025):

def get_audio_size(audiopath):
    extension = audiopath.rpartition(".")[-1].lower()
    if extension not in {"mp3", "wav", "flac"}:
        raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!")

    audio_info = mutagen.File(audiopath).info
    return int(audio_info.length * audio_info.sample_rate)

...Also, where is this code using Mutagen? GitHub and grep don't find it.

https://github.com/search?q=repo%3AArchiveBox%2FArchiveBox%20mutagen&type=code

Google says it's actually from a TTS program? Not ArchiveBox?

github.com/coqui-ai/TTS@5dcc16d193/TTS/tts/datasets/dataset.py (L47-L53)

<!-- gh-comment-id:2649113102 --> @Intralexical commented on GitHub (Feb 10, 2025): > ``` > def get_audio_size(audiopath): > extension = audiopath.rpartition(".")[-1].lower() > if extension not in {"mp3", "wav", "flac"}: > raise RuntimeError(f"The audio format {extension} is not supported, please convert the audio files to mp3, flac, or wav format!") > > audio_info = mutagen.File(audiopath).info > return int(audio_info.length * audio_info.sample_rate) > ``` ...Also, where is this code using Mutagen? GitHub and `grep` don't find it. https://github.com/search?q=repo%3AArchiveBox%2FArchiveBox%20mutagen&type=code Google says it's actually from a TTS program? Not ArchiveBox? https://github.com/coqui-ai/TTS/blob/5dcc16d1931538e5bce7cb20c1986df371ee8cd6/TTS/tts/datasets/dataset.py#L47-L53
Author
Owner

@pirate commented on GitHub (Feb 10, 2025):

ahh yeah after more digging the only mutagen dependency we have is through yt-dlp, closing because ArchiveBox doesn't directly use mutagen at all and ArchiveBox definitely qualifies as an aggregate. All communication with extractors is done by spawning separate subprocesses, we don't import extractor code directly.

<!-- gh-comment-id:2649364232 --> @pirate commented on GitHub (Feb 10, 2025): ahh yeah after more digging the only mutagen dependency we have is through `yt-dlp`, closing because ArchiveBox doesn't directly use mutagen at all and ArchiveBox definitely qualifies as an aggregate. All communication with extractors is done by spawning separate subprocesses, we don't import extractor code directly.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4005
No description provided.