[GH-ISSUE #53] [Bug Report] Incorrect Podcast File Suffix and Not Detecting Existing Files #44

Closed
opened 2026-02-27 04:57:15 +03:00 by kerem · 8 comments
Owner

Originally created by @Ragnaran on GitHub (Aug 18, 2025).
Original GitHub issue: https://github.com/Googolplexed0/zotify/issues/53

Originally assigned to: @Googolplexed0 on GitHub.

Downloading podcasts results in failures. The ffmpeg stream identification routine is appending \n STREAM to the resulting codec name, causing the EXT_MAP to fail and the file not being found.

This PR solves it: https://github.com/Googolplexed0/zotify/pull/52

Note that existing files are not being correctly checked correctly, so the podcast is being downloaded again, every time, even if it already exists. This is a slightly different issue that I haven't had a chance to investigate yet.

(I appreciate the effort to actually identify the codec! I'd bet a steak dinner that all podcasts are encoded in vorbis, but who knows if that will ever change?)

Originally created by @Ragnaran on GitHub (Aug 18, 2025). Original GitHub issue: https://github.com/Googolplexed0/zotify/issues/53 Originally assigned to: @Googolplexed0 on GitHub. Downloading podcasts results in failures. The ffmpeg stream identification routine is appending `\n STREAM` to the resulting codec name, causing the EXT_MAP to fail and the file not being found. This PR solves it: https://github.com/Googolplexed0/zotify/pull/52 Note that existing files are not being correctly checked correctly, so the podcast is being downloaded again, every time, even if it already exists. This is a slightly different issue that I haven't had a chance to investigate yet. (I appreciate the effort to actually identify the codec! I'd bet a steak dinner that all podcasts are encoded in vorbis, but who knows if that will ever change?)
kerem 2026-02-27 04:57:15 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@Ragnaran commented on GitHub (Aug 18, 2025):

This is about as far as I got with a more effective refactor, but it's not quite working yet. I'll take another stab tomorrow:

+++ b/zotify/podcast.py
@@ -86,7 +86,7 @@ def download_episode(episode_id, pbar_stack: list | None = None) -> None:
 
     with Loader(PrintChannel.PROGRESS_INFO, "Preparing download..."):
         filename = f"{podcast_name} - {episode_name}"
-        episode_path = PurePath(Zotify.CONFIG.get_root_podcast_path()) / podcast_name / f"{filename}.tmp"
+        episode_path = PurePath(Zotify.CONFIG.get_root_podcast_path()) / podcast_name / f"{filename}"
         create_download_directory(episode_path.parent)
 
         (raw, resp) = Zotify.invoke_url(PARTNER_URL + episode_id + '"}&extensions=' + PERSISTED_QUERY)
@@ -102,11 +102,17 @@ def download_episode(episode_id, pbar_stack: list | None = None) -> None:
                 wait_between_downloads(); return
 
             total_size: int = stream.input_stream.size
-            episode_path_exists = Path(episode_path).is_file() and Path(episode_path).stat().st_size == total_size
-            if episode_path_exists and Zotify.CONFIG.get_skip_existing():
+            episode_exists_on_filesystem = False
+            for extension_list_item in set(EXT_MAP.values()):
+                test_episode_path = Path(episode_path).with_suffix("." + extension_list_item)
+                Printer.debug(f"checking test_episode_path: {test_episode_path}")
+                if Path(test_episode_path).is_file() and Path(test_episode_path).stat().st_size == total_size and Zotify.CONFIG.get_skip_existing():
                     Printer.hashtaged(PrintChannel.SKIPPING, f'"{podcast_name} - {episode_name}" (EPISODE ALREADY EXISTS)')
+                    episode_exists_on_filesystem = True
+                    return
+            if episode_exists_on_filesystem == True:
                 wait_between_downloads(); return
-            
+            episode_path = Path(episode_path).with_suffix(".tmp")
             time_start = time.time()
             downloaded = 0
             pos, pbar_stack = Printer.pbar_position_handler(1, pbar_stack)
<!-- gh-comment-id:3198670421 --> @Ragnaran commented on GitHub (Aug 18, 2025): This is about as far as I got with a more effective refactor, but it's not quite working yet. I'll take another stab tomorrow: ``` +++ b/zotify/podcast.py @@ -86,7 +86,7 @@ def download_episode(episode_id, pbar_stack: list | None = None) -> None: with Loader(PrintChannel.PROGRESS_INFO, "Preparing download..."): filename = f"{podcast_name} - {episode_name}" - episode_path = PurePath(Zotify.CONFIG.get_root_podcast_path()) / podcast_name / f"{filename}.tmp" + episode_path = PurePath(Zotify.CONFIG.get_root_podcast_path()) / podcast_name / f"{filename}" create_download_directory(episode_path.parent) (raw, resp) = Zotify.invoke_url(PARTNER_URL + episode_id + '"}&extensions=' + PERSISTED_QUERY) @@ -102,11 +102,17 @@ def download_episode(episode_id, pbar_stack: list | None = None) -> None: wait_between_downloads(); return total_size: int = stream.input_stream.size - episode_path_exists = Path(episode_path).is_file() and Path(episode_path).stat().st_size == total_size - if episode_path_exists and Zotify.CONFIG.get_skip_existing(): + episode_exists_on_filesystem = False + for extension_list_item in set(EXT_MAP.values()): + test_episode_path = Path(episode_path).with_suffix("." + extension_list_item) + Printer.debug(f"checking test_episode_path: {test_episode_path}") + if Path(test_episode_path).is_file() and Path(test_episode_path).stat().st_size == total_size and Zotify.CONFIG.get_skip_existing(): Printer.hashtaged(PrintChannel.SKIPPING, f'"{podcast_name} - {episode_name}" (EPISODE ALREADY EXISTS)') + episode_exists_on_filesystem = True + return + if episode_exists_on_filesystem == True: wait_between_downloads(); return - + episode_path = Path(episode_path).with_suffix(".tmp") time_start = time.time() downloaded = 0 pos, pbar_stack = Printer.pbar_position_handler(1, pbar_stack) ```
Author
Owner

@Googolplexed0 commented on GitHub (Aug 19, 2025):

(I appreciate the effort to actually identify the codec! I'd bet a steak dinner that all podcasts are encoded in vorbis, but who knows if that will ever change?)

Thanks, I try to make this as robust as possible. Surprisingly, there are many that are hosted externally and encoded in mp3. That is why .mp3 is the fallback fallback for suffixes.

<!-- gh-comment-id:3198956810 --> @Googolplexed0 commented on GitHub (Aug 19, 2025): > (I appreciate the effort to actually identify the codec! I'd bet a steak dinner that all podcasts are encoded in vorbis, but who knows if that will ever change?) Thanks, I try to make this as robust as possible. Surprisingly, there are many that are hosted externally and encoded in mp3. That is why `.mp3` is the fallback fallback for suffixes.
Author
Owner

@Googolplexed0 commented on GitHub (Aug 19, 2025):

Both issues should now be fixed. Thanks for the bug find and fix!

<!-- gh-comment-id:3198961529 --> @Googolplexed0 commented on GitHub (Aug 19, 2025): Both issues should now be fixed. Thanks for the bug find and fix!
Author
Owner

@Ragnaran commented on GitHub (Aug 20, 2025):

Sadly, duplicates are still getting ignored. The logic episode_path_exists = Path(episode_path).is_file() is using the path + ".tmp" file name + the file size. A successfully downloaded file gets moved to the extension, so it won't have a .tmp extension, so the duplication check will never succeed. That's while I tried iterating against files that might have the extension in the EXT_MAP.

I'll see if I can fix it.

<!-- gh-comment-id:3207987343 --> @Ragnaran commented on GitHub (Aug 20, 2025): Sadly, duplicates are still getting ignored. The logic `episode_path_exists = Path(episode_path).is_file()` is using the `path + ".tmp"` file name + the file size. A successfully downloaded file gets moved to the extension, so it won't have a `.tmp` extension, so the duplication check will never succeed. That's while I tried iterating against files that might have the extension in the EXT_MAP. I'll see if I can fix it.
Author
Owner

@Googolplexed0 commented on GitHub (Aug 20, 2025):

The logic episode_path_exists = Path(episode_path).is_file() is using the path + ".tmp" file name + the file size

My implemented fix (b8fd011) replaced the .is_file() with .glob(). If you are still seeing .is_file(), you may need to update to >= v0.9.23.

<!-- gh-comment-id:3208075017 --> @Googolplexed0 commented on GitHub (Aug 20, 2025): > The logic `episode_path_exists = Path(episode_path).is_file()` is using the path + ".tmp" file name + the file size My implemented fix (b8fd011) replaced the `.is_file()` with `.glob()`. If you are still seeing `.is_file()`, you may need to update to >= v0.9.23.
Author
Owner

@Ragnaran commented on GitHub (Aug 20, 2025):

I nailed it down. I believe the stream.input_stream.size might be reported in 1024 increment chunks, while on-disk file sizes might not reach that. The Path(episode_path) methods might also have been mixed up resulting in checks against Path(PurePath(episode_path)) invocations - I couldn't be 100% sure. I've added a check to allow the on-disk filesize to be up to 1024 bytes smaller, but only by checking files that have a valid extension derived from the EXT_MAP; temp downloads are ignored.

The problem with .glob() (in my opinion) is that it wouldn't account for multiple matching files, or files that got copied/renamed.

Anyhoo, I whipped up PR https://github.com/Googolplexed0/zotify/pull/59 and tested it successfully.

<!-- gh-comment-id:3208110633 --> @Ragnaran commented on GitHub (Aug 20, 2025): I nailed it down. I believe the `stream.input_stream.size` might be reported in 1024 increment chunks, while on-disk file sizes might not reach that. The Path(episode_path) methods might also have been mixed up resulting in checks against `Path(PurePath(episode_path))` invocations - I couldn't be 100% sure. I've added a check to allow the on-disk filesize to be up to 1024 bytes smaller, but only by checking files that have a valid extension derived from the EXT_MAP; temp downloads are ignored. The problem with .glob() (in my opinion) is that it wouldn't account for multiple matching files, or files that got copied/renamed. Anyhoo, I whipped up PR https://github.com/Googolplexed0/zotify/pull/59 and tested it successfully.
Author
Owner

@Googolplexed0 commented on GitHub (Aug 21, 2025):

The problem with .glob() (in my opinion) is that it wouldn't account for multiple matching files, or files that got copied/renamed.

This isn't an issue with .glob()? There is no way to implement duplicate detection based on a filename pattern that would account for files that have been renamed. This would require checking against something other than filenames by definition. Also not sure what you mean by multiple matching files. The only part that is wildcarded in the .glob() is the file suffix, which accounts for the error cases where the file suffix exists outside of the EXT_MAP.

I've added a check to allow the on-disk filesize to be up to 1024 bytes smaller

I do like this idea overall though. Will implement something similar.

<!-- gh-comment-id:3208632717 --> @Googolplexed0 commented on GitHub (Aug 21, 2025): > The problem with .glob() (in my opinion) is that it wouldn't account for multiple matching files, or files that got copied/renamed. This isn't an issue with `.glob()`? There is no way to implement duplicate detection based on a filename pattern that would account for files that have been renamed. This would require checking against something other than filenames by definition. Also not sure what you mean by multiple matching files. The only part that is wildcarded in the `.glob()` is the file suffix, which accounts for the error cases where the file suffix exists outside of the EXT_MAP. > I've added a check to allow the on-disk filesize to be up to 1024 bytes smaller I do like this idea overall though. Will implement something similar.
Author
Owner

@Ragnaran commented on GitHub (Aug 21, 2025):

There is no way to implement duplicate detection based on a filename pattern that would account for files that have been renamed

Agreed. I figured "find the first possible file with a valid extension, check if the size is (close to) accurate, and if none exist, run the download. Anyhoo, thanks!

<!-- gh-comment-id:3211529952 --> @Ragnaran commented on GitHub (Aug 21, 2025): >There is no way to implement duplicate detection based on a filename pattern that would account for files that have been renamed Agreed. I figured "find the first possible file with a valid extension, check if the size is (close to) accurate, and if none exist, run the download. Anyhoo, thanks!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/zotify#44
No description provided.