[PR #21] Implement TrackExtractor for Spotify Track Data Extraction with Lyrics Support #14

New issue

Closed

opened 2026-03-07 19:32:02 +03:00 by kerem · 0 comments

kerem commented

2026-03-07 19:32:02 +03:00

Owner

Original Pull Request: https://github.com/AliAkhtari78/SpotifyScraper/pull/21

State: closed
Merged: Yes

This PR implements the TrackExtractor class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information.

🚀 Features Implemented

✅ Track metadata extraction (name, ID, URI, duration, artists, album)
✅ Preview URL and playability status extraction
✅ Lyrics extraction with synchronized timing
✅ Support for both regular and embed Spotify URLs
✅ Robust error handling with fallbacks

🧠 Implementation Details

The implementation uses a multi-strategy approach to extract track data:

First attempts to use embed URLs (more reliable, no authentication needed)
Falls back to regular URLs if embed extraction fails
Extracts data from Spotify's __NEXT_DATA__ script tag containing JSON data
Handles various field formats and structures in the Spotify API response

🧪 Testing

To validate the implementation, I created:

Added lyrics extraction to the extract_track_data function in json_parser.py
Extended test fixtures to validate lyrics extraction
Created standalone test scripts that verify the extraction works correctly

📋 Validation

The implementation was tested against real Spotify track data and successfully extracts:

Basic track information (ID, name, type)
Duration in milliseconds
Artist information
Album data with images
Preview URL for playback
Lyrics with line-synchronized timing data

📝 Example Output

{
  "id": "4u7EnebtmKWzUH433cf5Qv",
  "name": "Bohemian Rhapsody",
  "lyrics": {
    "sync_type": "LINE_SYNCED",
    "lines": [
      {
        "start_time_ms": 1000,
        "words": "Is this the real life?",
        "end_time_ms": 7000
      },
      // more lines...
    ],
    "provider": "SPOTIFY",
    "language": "en"
  }
}

Fixes #20.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

**Original Pull Request:** https://github.com/AliAkhtari78/SpotifyScraper/pull/21 **State:** closed **Merged:** Yes --- This PR implements the `TrackExtractor` class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information. ## 🚀 Features Implemented - ✅ Track metadata extraction (name, ID, URI, duration, artists, album) - ✅ Preview URL and playability status extraction - ✅ Lyrics extraction with synchronized timing - ✅ Support for both regular and embed Spotify URLs - ✅ Robust error handling with fallbacks ## 🧠 Implementation Details The implementation uses a multi-strategy approach to extract track data: 1. First attempts to use embed URLs (more reliable, no authentication needed) 2. Falls back to regular URLs if embed extraction fails 3. Extracts data from Spotify's `__NEXT_DATA__` script tag containing JSON data 4. Handles various field formats and structures in the Spotify API response ## 🧪 Testing To validate the implementation, I created: 1. Added lyrics extraction to the `extract_track_data` function in `json_parser.py` 2. Extended test fixtures to validate lyrics extraction 3. Created standalone test scripts that verify the extraction works correctly ## 📋 Validation The implementation was tested against real Spotify track data and successfully extracts: - Basic track information (ID, name, type) - Duration in milliseconds - Artist information - Album data with images - Preview URL for playback - Lyrics with line-synchronized timing data ## 📝 Example Output ```json { "id": "4u7EnebtmKWzUH433cf5Qv", "name": "Bohemian Rhapsody", "lyrics": { "sync_type": "LINE_SYNCED", "lines": [ { "start_time_ms": 1000, "words": "Is this the real life?", "end_time_ms": 7000 }, // more lines... ], "provider": "SPOTIFY", "language": "en" } } ``` Fixes #20. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.