[PR #21] [MERGED] Implement TrackExtractor for Spotify Track Data Extraction with Lyrics Support #101

New issue

Closed

opened 2026-03-13 23:02:32 +03:00 by kerem · 0 comments

kerem commented

2026-03-13 23:02:32 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/AliAkhtari78/SpotifyScraper/pull/21
Author: @Copilot
Created: 5/21/2025
Status: ✅ Merged
Merged: 5/22/2025
Merged by: @AliAkhtari78

Base: master ← Head: copilot/fix-20

📝 Commits (4)

bf561da Initial plan for issue
7321278 Add Token class and update constants for tests
86245d0 Complete the implementation of track extraction with lyrics support
202e2c5 Prioritize embed URLs for track extraction without authentication

📊 Changes

9 files changed (+928 additions, -57 deletions)

View changed files

➕ direct_test.py (+156 -0)
📝 src/spotify_scraper/core/constants.py (+10 -0)
📝 src/spotify_scraper/core/exceptions.py (+22 -0)
📝 src/spotify_scraper/extractors/track.py (+18 -29)
📝 src/spotify_scraper/parsers/json_parser.py (+29 -0)
➕ test_embed_url.py (+65 -0)
➕ test_track_extraction.py (+62 -0)
📝 tests/unit/test_track_extractor.py (+52 -28)
➕ track_extractor.py (+514 -0)

📄 Description

This PR implements the TrackExtractor class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information.

🚀 Features Implemented

✅ Track metadata extraction (name, ID, URI, duration, artists, album)
✅ Preview URL and playability status extraction
✅ Lyrics extraction with synchronized timing
✅ Support for both regular and embed Spotify URLs
✅ Robust error handling with fallbacks

🧠 Implementation Details

The implementation uses a multi-strategy approach to extract track data:

First attempts to use embed URLs (more reliable, no authentication needed)
Falls back to regular URLs if embed extraction fails
Extracts data from Spotify's __NEXT_DATA__ script tag containing JSON data
Handles various field formats and structures in the Spotify API response

🧪 Testing

To validate the implementation, I created:

Added lyrics extraction to the extract_track_data function in json_parser.py
Extended test fixtures to validate lyrics extraction
Created standalone test scripts that verify the extraction works correctly

📋 Validation

The implementation was tested against real Spotify track data and successfully extracts:

Basic track information (ID, name, type)
Duration in milliseconds
Artist information
Album data with images
Preview URL for playback
Lyrics with line-synchronized timing data

📝 Example Output

{
  "id": "4u7EnebtmKWzUH433cf5Qv",
  "name": "Bohemian Rhapsody",
  "lyrics": {
    "sync_type": "LINE_SYNCED",
    "lines": [
      {
        "start_time_ms": 1000,
        "words": "Is this the real life?",
        "end_time_ms": 7000
      },
      // more lines...
    ],
    "provider": "SPOTIFY",
    "language": "en"
  }
}

Fixes #20.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/AliAkhtari78/SpotifyScraper/pull/21 **Author:** [@Copilot](https://github.com/apps/copilot-swe-agent) **Created:** 5/21/2025 **Status:** ✅ Merged **Merged:** 5/22/2025 **Merged by:** [@AliAkhtari78](https://github.com/AliAkhtari78) **Base:** `master` ← **Head:** `copilot/fix-20` --- ### 📝 Commits (4) - [`bf561da`](https://github.com/AliAkhtari78/SpotifyScraper/commit/bf561da923105b54d0f995e93f28cc2a30fcbe0f) Initial plan for issue - [`7321278`](https://github.com/AliAkhtari78/SpotifyScraper/commit/73212780bdda9d7f209f82b9712560a76a77afd2) Add Token class and update constants for tests - [`86245d0`](https://github.com/AliAkhtari78/SpotifyScraper/commit/86245d02ef9d136a24bb00e98e7b8f861f8c227a) Complete the implementation of track extraction with lyrics support - [`202e2c5`](https://github.com/AliAkhtari78/SpotifyScraper/commit/202e2c5ef1014487cc62fc396bcd8f9851e07427) Prioritize embed URLs for track extraction without authentication ### 📊 Changes **9 files changed** (+928 additions, -57 deletions) <details> <summary>View changed files</summary> ➕ `direct_test.py` (+156 -0) 📝 `src/spotify_scraper/core/constants.py` (+10 -0) 📝 `src/spotify_scraper/core/exceptions.py` (+22 -0) 📝 `src/spotify_scraper/extractors/track.py` (+18 -29) 📝 `src/spotify_scraper/parsers/json_parser.py` (+29 -0) ➕ `test_embed_url.py` (+65 -0) ➕ `test_track_extraction.py` (+62 -0) 📝 `tests/unit/test_track_extractor.py` (+52 -28) ➕ `track_extractor.py` (+514 -0) </details> ### 📄 Description This PR implements the `TrackExtractor` class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information. ## 🚀 Features Implemented - ✅ Track metadata extraction (name, ID, URI, duration, artists, album) - ✅ Preview URL and playability status extraction - ✅ Lyrics extraction with synchronized timing - ✅ Support for both regular and embed Spotify URLs - ✅ Robust error handling with fallbacks ## 🧠 Implementation Details The implementation uses a multi-strategy approach to extract track data: 1. First attempts to use embed URLs (more reliable, no authentication needed) 2. Falls back to regular URLs if embed extraction fails 3. Extracts data from Spotify's `__NEXT_DATA__` script tag containing JSON data 4. Handles various field formats and structures in the Spotify API response ## 🧪 Testing To validate the implementation, I created: 1. Added lyrics extraction to the `extract_track_data` function in `json_parser.py` 2. Extended test fixtures to validate lyrics extraction 3. Created standalone test scripts that verify the extraction works correctly ## 📋 Validation The implementation was tested against real Spotify track data and successfully extracts: - Basic track information (ID, name, type) - Duration in milliseconds - Artist information - Album data with images - Preview URL for playback - Lyrics with line-synchronized timing data ## 📝 Example Output ```json { "id": "4u7EnebtmKWzUH433cf5Qv", "name": "Bohemian Rhapsody", "lyrics": { "sync_type": "LINE_SYNCED", "lines": [ { "start_time_ms": 1000, "words": "Is this the real life?", "end_time_ms": 7000 }, // more lines... ], "provider": "SPOTIFY", "language": "en" } } ``` Fixes #20. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>