[PR #21] [MERGED] Implement TrackExtractor for Spotify Track Data Extraction with Lyrics Support #101

Closed
opened 2026-03-13 23:02:32 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/AliAkhtari78/SpotifyScraper/pull/21
Author: @Copilot
Created: 5/21/2025
Status: Merged
Merged: 5/22/2025
Merged by: @AliAkhtari78

Base: masterHead: copilot/fix-20


📝 Commits (4)

  • bf561da Initial plan for issue
  • 7321278 Add Token class and update constants for tests
  • 86245d0 Complete the implementation of track extraction with lyrics support
  • 202e2c5 Prioritize embed URLs for track extraction without authentication

📊 Changes

9 files changed (+928 additions, -57 deletions)

View changed files

direct_test.py (+156 -0)
📝 src/spotify_scraper/core/constants.py (+10 -0)
📝 src/spotify_scraper/core/exceptions.py (+22 -0)
📝 src/spotify_scraper/extractors/track.py (+18 -29)
📝 src/spotify_scraper/parsers/json_parser.py (+29 -0)
test_embed_url.py (+65 -0)
test_track_extraction.py (+62 -0)
📝 tests/unit/test_track_extractor.py (+52 -28)
track_extractor.py (+514 -0)

📄 Description

This PR implements the TrackExtractor class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information.

🚀 Features Implemented

  • Track metadata extraction (name, ID, URI, duration, artists, album)
  • Preview URL and playability status extraction
  • Lyrics extraction with synchronized timing
  • Support for both regular and embed Spotify URLs
  • Robust error handling with fallbacks

🧠 Implementation Details

The implementation uses a multi-strategy approach to extract track data:

  1. First attempts to use embed URLs (more reliable, no authentication needed)
  2. Falls back to regular URLs if embed extraction fails
  3. Extracts data from Spotify's __NEXT_DATA__ script tag containing JSON data
  4. Handles various field formats and structures in the Spotify API response

🧪 Testing

To validate the implementation, I created:

  1. Added lyrics extraction to the extract_track_data function in json_parser.py
  2. Extended test fixtures to validate lyrics extraction
  3. Created standalone test scripts that verify the extraction works correctly

📋 Validation

The implementation was tested against real Spotify track data and successfully extracts:

  • Basic track information (ID, name, type)
  • Duration in milliseconds
  • Artist information
  • Album data with images
  • Preview URL for playback
  • Lyrics with line-synchronized timing data

📝 Example Output

{
  "id": "4u7EnebtmKWzUH433cf5Qv",
  "name": "Bohemian Rhapsody",
  "lyrics": {
    "sync_type": "LINE_SYNCED",
    "lines": [
      {
        "start_time_ms": 1000,
        "words": "Is this the real life?",
        "end_time_ms": 7000
      },
      // more lines...
    ],
    "provider": "SPOTIFY",
    "language": "en"
  }
}

Fixes #20.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/AliAkhtari78/SpotifyScraper/pull/21 **Author:** [@Copilot](https://github.com/apps/copilot-swe-agent) **Created:** 5/21/2025 **Status:** ✅ Merged **Merged:** 5/22/2025 **Merged by:** [@AliAkhtari78](https://github.com/AliAkhtari78) **Base:** `master` ← **Head:** `copilot/fix-20` --- ### 📝 Commits (4) - [`bf561da`](https://github.com/AliAkhtari78/SpotifyScraper/commit/bf561da923105b54d0f995e93f28cc2a30fcbe0f) Initial plan for issue - [`7321278`](https://github.com/AliAkhtari78/SpotifyScraper/commit/73212780bdda9d7f209f82b9712560a76a77afd2) Add Token class and update constants for tests - [`86245d0`](https://github.com/AliAkhtari78/SpotifyScraper/commit/86245d02ef9d136a24bb00e98e7b8f861f8c227a) Complete the implementation of track extraction with lyrics support - [`202e2c5`](https://github.com/AliAkhtari78/SpotifyScraper/commit/202e2c5ef1014487cc62fc396bcd8f9851e07427) Prioritize embed URLs for track extraction without authentication ### 📊 Changes **9 files changed** (+928 additions, -57 deletions) <details> <summary>View changed files</summary> ➕ `direct_test.py` (+156 -0) 📝 `src/spotify_scraper/core/constants.py` (+10 -0) 📝 `src/spotify_scraper/core/exceptions.py` (+22 -0) 📝 `src/spotify_scraper/extractors/track.py` (+18 -29) 📝 `src/spotify_scraper/parsers/json_parser.py` (+29 -0) ➕ `test_embed_url.py` (+65 -0) ➕ `test_track_extraction.py` (+62 -0) 📝 `tests/unit/test_track_extractor.py` (+52 -28) ➕ `track_extractor.py` (+514 -0) </details> ### 📄 Description This PR implements the `TrackExtractor` class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information. ## 🚀 Features Implemented - ✅ Track metadata extraction (name, ID, URI, duration, artists, album) - ✅ Preview URL and playability status extraction - ✅ Lyrics extraction with synchronized timing - ✅ Support for both regular and embed Spotify URLs - ✅ Robust error handling with fallbacks ## 🧠 Implementation Details The implementation uses a multi-strategy approach to extract track data: 1. First attempts to use embed URLs (more reliable, no authentication needed) 2. Falls back to regular URLs if embed extraction fails 3. Extracts data from Spotify's `__NEXT_DATA__` script tag containing JSON data 4. Handles various field formats and structures in the Spotify API response ## 🧪 Testing To validate the implementation, I created: 1. Added lyrics extraction to the `extract_track_data` function in `json_parser.py` 2. Extended test fixtures to validate lyrics extraction 3. Created standalone test scripts that verify the extraction works correctly ## 📋 Validation The implementation was tested against real Spotify track data and successfully extracts: - Basic track information (ID, name, type) - Duration in milliseconds - Artist information - Album data with images - Preview URL for playback - Lyrics with line-synchronized timing data ## 📝 Example Output ```json { "id": "4u7EnebtmKWzUH433cf5Qv", "name": "Bohemian Rhapsody", "lyrics": { "sync_type": "LINE_SYNCED", "lines": [ { "start_time_ms": 1000, "words": "Is this the real life?", "end_time_ms": 7000 }, // more lines... ], "provider": "SPOTIFY", "language": "en" } } ``` Fixes #20. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-13 23:02:32 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/SpotifyScraper#101
No description provided.