[PR #21] Implement TrackExtractor for Spotify Track Data Extraction with Lyrics Support #14

Closed
opened 2026-03-07 19:32:02 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/AliAkhtari78/SpotifyScraper/pull/21

State: closed
Merged: Yes


This PR implements the TrackExtractor class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information.

🚀 Features Implemented

  • Track metadata extraction (name, ID, URI, duration, artists, album)
  • Preview URL and playability status extraction
  • Lyrics extraction with synchronized timing
  • Support for both regular and embed Spotify URLs
  • Robust error handling with fallbacks

🧠 Implementation Details

The implementation uses a multi-strategy approach to extract track data:

  1. First attempts to use embed URLs (more reliable, no authentication needed)
  2. Falls back to regular URLs if embed extraction fails
  3. Extracts data from Spotify's __NEXT_DATA__ script tag containing JSON data
  4. Handles various field formats and structures in the Spotify API response

🧪 Testing

To validate the implementation, I created:

  1. Added lyrics extraction to the extract_track_data function in json_parser.py
  2. Extended test fixtures to validate lyrics extraction
  3. Created standalone test scripts that verify the extraction works correctly

📋 Validation

The implementation was tested against real Spotify track data and successfully extracts:

  • Basic track information (ID, name, type)
  • Duration in milliseconds
  • Artist information
  • Album data with images
  • Preview URL for playback
  • Lyrics with line-synchronized timing data

📝 Example Output

{
  "id": "4u7EnebtmKWzUH433cf5Qv",
  "name": "Bohemian Rhapsody",
  "lyrics": {
    "sync_type": "LINE_SYNCED",
    "lines": [
      {
        "start_time_ms": 1000,
        "words": "Is this the real life?",
        "end_time_ms": 7000
      },
      // more lines...
    ],
    "provider": "SPOTIFY",
    "language": "en"
  }
}

Fixes #20.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

**Original Pull Request:** https://github.com/AliAkhtari78/SpotifyScraper/pull/21 **State:** closed **Merged:** Yes --- This PR implements the `TrackExtractor` class for extracting comprehensive track data from Spotify web pages. The implementation includes the exciting new feature of extracting lyrics with synchronized timing information. ## 🚀 Features Implemented - ✅ Track metadata extraction (name, ID, URI, duration, artists, album) - ✅ Preview URL and playability status extraction - ✅ Lyrics extraction with synchronized timing - ✅ Support for both regular and embed Spotify URLs - ✅ Robust error handling with fallbacks ## 🧠 Implementation Details The implementation uses a multi-strategy approach to extract track data: 1. First attempts to use embed URLs (more reliable, no authentication needed) 2. Falls back to regular URLs if embed extraction fails 3. Extracts data from Spotify's `__NEXT_DATA__` script tag containing JSON data 4. Handles various field formats and structures in the Spotify API response ## 🧪 Testing To validate the implementation, I created: 1. Added lyrics extraction to the `extract_track_data` function in `json_parser.py` 2. Extended test fixtures to validate lyrics extraction 3. Created standalone test scripts that verify the extraction works correctly ## 📋 Validation The implementation was tested against real Spotify track data and successfully extracts: - Basic track information (ID, name, type) - Duration in milliseconds - Artist information - Album data with images - Preview URL for playback - Lyrics with line-synchronized timing data ## 📝 Example Output ```json { "id": "4u7EnebtmKWzUH433cf5Qv", "name": "Bohemian Rhapsody", "lyrics": { "sync_type": "LINE_SYNCED", "lines": [ { "start_time_ms": 1000, "words": "Is this the real life?", "end_time_ms": 7000 }, // more lines... ], "provider": "SPOTIFY", "language": "en" } } ``` Fixes #20. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.
kerem 2026-03-07 19:32:02 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/SpotifyScraper#14
No description provided.