starred/SpotifyScraper

Fork 0

mirror of https://github.com/AliAkhtari78/SpotifyScraper.git synced 2026-04-25 19:45:49 +03:00

[GH-ISSUE #20] Implement TrackExtractor for Spotify Track Data Extraction #66

New issue

Closed

opened 2026-03-13 22:58:23 +03:00 by kerem · 1 comment

kerem commented

2026-03-13 22:58:23 +03:00

Owner

Originally created by @AliAkhtari78 on GitHub (May 21, 2025).
Original GitHub issue: https://github.com/AliAkhtari78/SpotifyScraper/issues/20

Originally assigned to: @Copilot on GitHub.

📋 Overview

You are implementing the heart of the SpotifyScraper library - the TrackExtractor class. This component will extract comprehensive track data from Spotify web pages, including the exciting new feature: lyrics with timing information.

🎯 Success Criteria

✅ Extract all track metadata (name, ID, URI, duration, artists, album)
✅ Extract preview URLs and playability status
✅ Extract lyrics with synchronized timing when available
✅ Handle both regular and embed Spotify URLs seamlessly
✅ Return data matching tests/fixtures/json/track_expected.json
✅ Pass all tests in tests/unit/test_track_extractor.py

🛠️ Implementation Requirements

Core File to Create

📁 src/spotify_scraper/extractors/track.py

Required Class Structure

class TrackExtractor:
    def __init__(self, browser: Browser):
        """Initialize with browser interface"""
        
    def extract(self, url: str) -> TrackData:
        """Main extraction method"""
        
    def _validate_url(self, url: str) -> str:
        """Validate and convert to embed URL"""
        
    def _extract_from_html(self, html: str) -> TrackData:
        """Extract data from HTML content"""

🔍 Technical Deep Dive

Understanding Spotify's Architecture

Spotify uses a React-based interface where data is embedded in a __NEXT_DATA__ script tag. Here's what you need to know:

URL Type	Access Level	Data Available	Login Required
`open.spotify.com/track/ID`	Full	Complete + Lyrics	✅ Yes
`open.spotify.com/embed/track/ID`	Limited	Basic Metadata	❌ No

Data Extraction Strategy

graph TD
    A[Input URL] --> B{Is Embed URL?}
    B -->|No| C[Convert to Embed]
    B -->|Yes| D[Use Directly]
    C --> D
    D --> E[Fetch HTML]
    E --> F[Parse __NEXT_DATA__]
    F --> G[Extract Track Data]
    G --> H[Return TrackData]

🧪 Testing & Validation

Test Data Location

Input HTML: tests/fixtures/html/track_modern.html
Expected Output: tests/fixtures/json/track_expected.json
Test File: tests/unit/test_track_extractor.py

Running Tests

# Run specific test
pytest tests/unit/test_track_extractor.py -v

# Run with coverage
pytest tests/unit/test_track_extractor.py --cov=src/spotify_scraper/extractors

# Run in watch mode
pytest-watch tests/unit/test_track_extractor.py

🚀 Implementation Approaches

Approach 1: Use Existing Browser Interface ⭐ Recommended

from spotify_scraper.browsers.base import Browser
from spotify_scraper.parsers.json_parser import extract_track_data_from_page

class TrackExtractor:
    def __init__(self, browser: Browser):
        self.browser = browser
    
    def extract(self, url: str) -> TrackData:
        html = self.browser.get_page_content(url)
        return extract_track_data_from_page(html)

Approach 2: Direct HTTP Request 🔧 If browser fails

import requests
from spotify_scraper.utils.url import convert_to_embed_url

def extract(self, url: str) -> TrackData:
    embed_url = convert_to_embed_url(url)
    response = requests.get(embed_url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    return self._parse_html(response.text)

Approach 3: Live Testing 🌐 For verification

Test with real URLs (use embed versions to avoid login):

https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv (Bohemian Rhapsody)
https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3 (Shape of You)

🆘 Troubleshooting Guide

Issue: Cannot Access Dependencies

Solution: Import what you need or create minimal implementations:

# If browser interface not ready
class MockBrowser:
    def get_page_content(self, url):
        return requests.get(url).text

# If JSON parser not ready  
def extract_from_next_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    script = soup.find('script', {'id': '__NEXT_DATA__'})
    return json.loads(script.string)

Issue: Tests Failing

Check fixture data: Compare your output with expected JSON
Debug step by step: Print intermediate results
Validate HTML structure: Ensure you're parsing the right elements

Issue: URL Access Problems

✅ Always convert to embed URLs first
✅ Use proper User-Agent headers
✅ Handle rate limiting with delays
✅ Implement retry logic

Issue: Missing Lyrics

Lyrics are only available on full URLs (not embeds). For now:

Extract basic data from embed URLs
Return None for lyrics field
Document this limitation

🤝 Interactive Support

When You're Stuck

🔍 Research: Test with live Spotify URLs to understand the data structure
💬 Ask Questions:
- Comment on this issue with specific problems
- Tag me (@AliAkhtari78) for urgent help
- Open a new issue for complex blockers
🛠️ Get Creative:
- Create utility functions as needed
- Write helper methods for complex parsing
- Build debugging tools to inspect data

Communication Examples

@AliAkhtari78 The __NEXT_DATA__ structure is different than expected. 
Current structure: {...}
Expected structure: {...}
Should I adapt the parser or update the test fixture?

📚 Resources & References

Code References

core/types.py - TrackData type definition
core/exceptions.py - Error handling
parsers/json_parser.py - JSON parsing utilities

External Resources

Spotify-Specific

Regular URL: https://open.spotify.com/track/4u7EnebtmKWzUH433cf5Qv
Embed URL: https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv

🔍 Live Testing URLs

Test with current Spotify data:

# Quick test to see current structure
import requests
response = requests.get('https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv')
print(response.text[:10000])  # First 10000 chars
---

## ⚡ **Quick Start Checklist**

- [ ] 📁 Create `src/spotify_scraper/extractors/track.py`
- [ ] 🏗️ Implement `TrackExtractor` class
- [ ] 🔧 Add URL validation and conversion
- [ ] 📊 Implement HTML parsing logic  
- [ ] 🧪 Run tests to verify implementation
- [ ] 📝 Update tests if needed (with approval)
- [ ] 🎉 Celebrate your contribution!

---

## 💡 **Pro Tips**

> **🎯 Focus on embed URLs first** - They don't require authentication and provide most of the needed data
> 
> **🧪 Use the test fixtures** - They contain real Spotify data structures
> 
> **🔄 Iterate quickly** - Build basic extraction first, then add features
> 
> **📞 Communicate early** - Ask questions before you're completely stuck

---

Originally created by @AliAkhtari78 on GitHub (May 21, 2025). Original GitHub issue: https://github.com/AliAkhtari78/SpotifyScraper/issues/20 Originally assigned to: @Copilot on GitHub. ## 📋 **Overview** You are implementing the heart of the SpotifyScraper library - the `TrackExtractor` class. This component will extract comprehensive track data from Spotify web pages, including the exciting new feature: **lyrics with timing information**. ### 🎯 **Success Criteria** - ✅ Extract all track metadata (name, ID, URI, duration, artists, album) - ✅ Extract preview URLs and playability status - ✅ Extract lyrics with synchronized timing when available - ✅ Handle both regular and embed Spotify URLs seamlessly - ✅ Return data matching `tests/fixtures/json/track_expected.json` - ✅ Pass all tests in `tests/unit/test_track_extractor.py` --- ## 🛠️ **Implementation Requirements** ### **Core File to Create** ``` 📁 src/spotify_scraper/extractors/track.py ``` ### **Required Class Structure** ```python class TrackExtractor: def __init__(self, browser: Browser): """Initialize with browser interface""" def extract(self, url: str) -> TrackData: """Main extraction method""" def _validate_url(self, url: str) -> str: """Validate and convert to embed URL""" def _extract_from_html(self, html: str) -> TrackData: """Extract data from HTML content""" ``` --- ## 🔍 **Technical Deep Dive** ### **Understanding Spotify's Architecture** Spotify uses a **React-based interface** where data is embedded in a `__NEXT_DATA__` script tag. Here's what you need to know: | URL Type | Access Level | Data Available | Login Required | |----------|--------------|----------------|----------------| | `open.spotify.com/track/ID` | Full | Complete + Lyrics | ✅ Yes | | `open.spotify.com/embed/track/ID` | Limited | Basic Metadata | ❌ No | ### **Data Extraction Strategy** ```mermaid graph TD A[Input URL] --> B{Is Embed URL?} B -->|No| C[Convert to Embed] B -->|Yes| D[Use Directly] C --> D D --> E[Fetch HTML] E --> F[Parse __NEXT_DATA__] F --> G[Extract Track Data] G --> H[Return TrackData] ``` --- ## 🧪 **Testing & Validation** ### **Test Data Location** - **Input HTML**: `tests/fixtures/html/track_modern.html` - **Expected Output**: `tests/fixtures/json/track_expected.json` - **Test File**: `tests/unit/test_track_extractor.py` ### **Running Tests** ```bash # Run specific test pytest tests/unit/test_track_extractor.py -v # Run with coverage pytest tests/unit/test_track_extractor.py --cov=src/spotify_scraper/extractors # Run in watch mode pytest-watch tests/unit/test_track_extractor.py ``` --- ## 🚀 **Implementation Approaches** ### **Approach 1: Use Existing Browser Interface** ⭐ *Recommended* ```python from spotify_scraper.browsers.base import Browser from spotify_scraper.parsers.json_parser import extract_track_data_from_page class TrackExtractor: def __init__(self, browser: Browser): self.browser = browser def extract(self, url: str) -> TrackData: html = self.browser.get_page_content(url) return extract_track_data_from_page(html) ``` ### **Approach 2: Direct HTTP Request** 🔧 *If browser fails* ```python import requests from spotify_scraper.utils.url import convert_to_embed_url def extract(self, url: str) -> TrackData: embed_url = convert_to_embed_url(url) response = requests.get(embed_url, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }) return self._parse_html(response.text) ``` ### **Approach 3: Live Testing** 🌐 *For verification* Test with real URLs (use embed versions to avoid login): - `https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv` (Bohemian Rhapsody) - `https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3` (Shape of You) --- ## 🆘 **Troubleshooting Guide** ### **Issue: Cannot Access Dependencies** **Solution**: Import what you need or create minimal implementations: ```python # If browser interface not ready class MockBrowser: def get_page_content(self, url): return requests.get(url).text # If JSON parser not ready def extract_from_next_data(html): soup = BeautifulSoup(html, 'html.parser') script = soup.find('script', {'id': '__NEXT_DATA__'}) return json.loads(script.string) ``` ### **Issue: Tests Failing** 1. **Check fixture data**: Compare your output with expected JSON 2. **Debug step by step**: Print intermediate results 3. **Validate HTML structure**: Ensure you're parsing the right elements ### **Issue: URL Access Problems** - ✅ Always convert to embed URLs first - ✅ Use proper User-Agent headers - ✅ Handle rate limiting with delays - ✅ Implement retry logic ### **Issue: Missing Lyrics** Lyrics are only available on full URLs (not embeds). For now: - Extract basic data from embed URLs - Return `None` for lyrics field - Document this limitation --- ## 🤝 **Interactive Support** ### **When You're Stuck** 1. **🔍 Research**: Test with live Spotify URLs to understand the data structure 2. **💬 Ask Questions**: - Comment on this issue with specific problems - Tag me (@AliAkhtari78) for urgent help - Open a new issue for complex blockers 3. **🛠️ Get Creative**: - Create utility functions as needed - Write helper methods for complex parsing - Build debugging tools to inspect data ### **Communication Examples** ```markdown @AliAkhtari78 The __NEXT_DATA__ structure is different than expected. Current structure: {...} Expected structure: {...} Should I adapt the parser or update the test fixture? ``` --- ## 📚 **Resources & References** ### **Code References** - [`core/types.py`](../src/spotify_scraper/core/types.py) - TrackData type definition - [`core/exceptions.py`](../src/spotify_scraper/core/exceptions.py) - Error handling - [`parsers/json_parser.py`](../src/spotify_scraper/parsers/json_parser.py) - JSON parsing utilities ### **External Resources** - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [Requests Documentation](https://requests.readthedocs.io/) - [Python Type Hints Guide](https://docs.python.org/3/library/typing.html) ### **Spotify-Specific** - Regular URL: `https://open.spotify.com/track/4u7EnebtmKWzUH433cf5Qv` - Embed URL: `https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv` ### 🔍 Live Testing URLs Test with current Spotify data: - Regular: https://open.spotify.com/track/4u7EnebtmKWzUH433cf5Qv - Embed: https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv ```python # Quick test to see current structure import requests response = requests.get('https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv') print(response.text[:10000]) # First 10000 chars --- ## ⚡ **Quick Start Checklist** - [ ] 📁 Create `src/spotify_scraper/extractors/track.py` - [ ] 🏗️ Implement `TrackExtractor` class - [ ] 🔧 Add URL validation and conversion - [ ] 📊 Implement HTML parsing logic - [ ] 🧪 Run tests to verify implementation - [ ] 📝 Update tests if needed (with approval) - [ ] 🎉 Celebrate your contribution! --- ## 💡 **Pro Tips** > **🎯 Focus on embed URLs first** - They don't require authentication and provide most of the needed data > > **🧪 Use the test fixtures** - They contain real Spotify data structures > > **🔄 Iterate quickly** - Build basic extraction first, then add features > > **📞 Communicate early** - Ask questions before you're completely stuck ---

kerem

2026-03-13 22:58:23 +03:00

closed this issue
added the
enhancement

bug

refactoring
labels

kerem commented

2026-03-13 22:58:39 +03:00

Author

Owner

@AliAkhtari78 commented on GitHub (May 22, 2025):

@Copilot, please summarize and report what you accomplished in this stage. The report should be detailed and well-formatted like the originally generated issue.

@AliAkhtari78 commented on GitHub (May 22, 2025): @Copilot, please summarize and report what you accomplished in this stage. The report should be detailed and well-formatted like the originally generated issue.

kerem referenced this issue

2026-03-13 23:00:45 +03:00

[GH-ISSUE #66] Identify any missing pages at https://spotifyscraper.readthedocs.io #83

kerem referenced this issue

2026-03-13 23:03:53 +03:00

[PR #67] [MERGED] Audit and identify 12 missing documentation pages causing broken links #130