[PR #29] [Failed] Build the data extraction pipeline #18

Closed
opened 2026-03-07 19:32:03 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/AliAkhtari78/SpotifyScraper/pull/29

State: closed
Merged: No


Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress.

Original issue description:

> Mission Critical: Build the data extraction pipeline that transforms Spotify's HTML into structured track data with lyrics support


📊 Project Context & Importance

What This Task Accomplishes

This task completes the core data extraction pipeline for SpotifyScraper 2.0. You're implementing the critical function that:

  • Transforms raw Spotify HTML into structured, usable data
  • Enables the exciting new lyrics extraction feature
  • Powers all future track-based functionality in the library
  • Validates that our modern architecture actually works end-to-end

Why This Task Matters

Without robust JSON parsing, the entire SpotifyScraper library is just empty scaffolding. This implementation:

  • Unlocks user value: Enables actual music data extraction
  • Validates architecture: Proves our new modular design works
  • Enables expansion: Creates the pattern for album/artist/playlist extractors
  • Delivers innovation: Adds lyrics with timing that wasn't in v1.0

🎯 Mission Objectives

Primary Goals

  • Parse Track Metadata: Extract name, ID, URI, duration, artists, album info
  • Extract Media URLs: Get preview audio and cover art links
  • Parse Lyrics Data: Extract synchronized lyrics with timing information
  • Handle Edge Cases: Gracefully handle missing or malformed data
  • Pass All Tests: Meet 100% success criteria for validation

Success Metrics

Metric Target How to Measure
Fixture Test 100% match Output matches track_expected.json exactly
Live URL Test 3/3 working All test URLs extract successfully
Unit Tests All passing pytest test_track_extractor.py green
Error Handling Graceful degradation Returns partial data instead of crashing

🔍 Phase 1: Research and Discovery

Step 1.1: Understand Current Spotify Architecture 🌐

Use your web search capabilities to research the current Spotify web structure:

Research these topics systematically:

  1. "Spotify web player NEXT_DATA structure 2025"
  2. "Spotify embed page JSON data format"
  3. "Spotify track page HTML structure changes"

Interactive Checkpoint: After researching, analyze what you find and ask me (@AliAkhtari78):

  • Are there any major changes in Spotify's structure since our fixtures were created?
  • Should we update our test fixtures based on current live data?
  • Are there new data fields we should consider extracting?

Step 1.2: Analyze Existing Test Fixtures 📊

Examine the provided test data systematically:

  1. Load and inspect the fixture HTML:
with open('tests/fixtures/html/track_modern.html', 'r', encoding='utf-8') as f:
    html_content = f.read()

Find the NEXT_DATA script tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
script_tag = soup.find('script', {'id': 'NEXT_DATA'})
print("JSON structure preview:")
print(script_tag.string[:500] + "...")

  1. Compare with expected output:
import json
with open('tests/fixtures/json/track_expected.json', 'r') as f:
    expected = json.load(f)

print("Expected output structure:")
for key in expected.keys():
print(f" {key}: {type(expected[key])}")

Interactive Checkpoint: After analysis, tag me (@AliAkhtari78) with:

  • Any discrepancies you find between fixture and expected output
  • Questions about data field priorities or edge cases
  • Suggestions for additional test cases

Step 1.3: Test Live Spotify URLs 🎵

Use your browser tools to fetch current live data:

import requests

test_urls = [
"https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv", # Bohemian Rhapsody
"https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3", # Shape of You
"https://open.spotify.com/embed/track/1Ax3zx5TJBRi4Ol8hPU9N8", # Anti-Hero
]

for url in test_urls:
try:
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
print(f"URL: {url}")
print(f"Status: {response.status_code}")

    # Look for __NEXT_DATA__
    if '__NEXT_DATA__' in response.text:
        print("✅ Contains __NEXT_DATA__")
    else:
        print("❌ No __NEXT_DATA__ found")
        
except Exception as e:
    print(f"❌ Error accessing {url}: {e}")

Interactive Checkpoint: Ask me (@AliAkhtari78):

  • If any test URLs fail to load or have structural differences
  • Whether you should create new test fixtures from current live data
  • If you need different test URLs for better coverage

🛠️ Phase 2: Implementation Strategy

Step 2.1: Design the Parsing Pipeline 🏗️

Map out your implementation approach:

def extract_track_data_from_page(html_content: str) -> TrackData:
    """
    Implementation roadmap:
1. Extract __NEXT_DATA__ JSON from HTML
2. Navigate to track entity in JSON structure  
3. Parse basic track metadata (name, ID, URI, etc.)
4. Extract artist information
5. Parse album data and images
6. Extract audio preview URLs
7. Parse lyrics with timing (if available)
8. Handle missing data gracefully
9. Return structured TrackData object
"""
pass

Create a development checklist:

  • [ ] JSON extraction from HTML works
  • [ ] Basic track fields parsing
  • [ ] Artist data extraction
  • [ ] Album data with images
  • [ ] Preview URL extraction
  • [ ] Lyrics parsing with timing
  • [ ] Error handling for missing data
  • [ ] Type compliance with TrackData

Step 2.2: Implement Core JSON Extraction 📄

Start with the foundation - getting JSON from HTML:

import json
from bs4 import BeautifulSoup
from spotify_scraper.core.exceptions import ParsingError

def extract_next_data_json(html_content: str) -> dict:
"""Extract and parse NEXT_DATA JSON from Spotify page."""
try:
soup = BeautifulSoup(html_content, 'html.parser')
script_tag = soup.find('script', {'id': 'NEXT_DATA'})

    if not script_tag or not script_tag.string:
        raise ParsingError("No __NEXT_DATA__ script tag found")
        
    return json.loads(script_tag.string)
    
except json.JSONDecodeError as e:
    raise ParsingError(f"Invalid JSON in __NEXT_DATA__: {e}")
except Exception as e:
    raise ParsingError(f"Failed to extract JSON: {e}")

Validation checkpoint:

# Test with your fixture
with open('tests/fixtures/html/track_modern.html', 'r') as f:
    html = f.read()

json_data = extract_next_data_json(html)
print(f"Successfully extracted JSON with {len(json_data)} top-level keys")
print(f"Keys: {list(json_data.keys())}")

Step 2.3: Navigate JSON Structure 🗺️

Implement the path navigation to track data:

def get_track_entity(json_data: dict) -> dict:
    """Navigate to track entity in Spotify JSON structure."""
    try:
        # Follow the path: props.pageProps.state.data.entity
        entity = (json_data
                 .get('props', {})
                 .get('pageProps', {})
                 .get('state', {})
                 .get('data', {})
                 .get('entity', {}))
    if not entity or entity.get('type') != 'track':
        raise ParsingError("Track entity not found or invalid type")
        
    return entity
    
except Exception as e:
    raise ParsingError(f"Failed to navigate to track entity: {e}")

Validation checkpoint:

json_data = extract_next_data_json(html_content)
track_entity = get_track_entity(json_data)
print(f"Track entity keys: {list(track_entity.keys())}")
print(f"Track name: {track_entity.get('name', 'NOT FOUND')}")

Step 2.4: Implement Systematic Data Extraction 📊

Build extractors for each data category:

def extract_basic_track_info(entity: dict) -> dict:
    """Extract core track information."""
    return {
        'id': entity.get('id', ''),
        'name': entity.get('name', ''),
        'uri': entity.get('uri', ''),
        'type': 'track',
        'duration_ms': self._safe_extract_duration(entity),
        'is_playable': entity.get('playability', {}).get('playable', False),
        'is_explicit': self._extract_explicit_flag(entity),
    }

def extract_artists_data(entity: dict) -> list:
"""Extract artist information."""
artists = []
artists_data = entity.get('artists', {}).get('items', [])

for artist in artists_data:
    profile = artist.get('profile', {})
    artists.append({
        'name': profile.get('name', ''),
        'uri': artist.get('uri', ''),
        'id': artist.get('uri', '').split(':')[-1] if artist.get('uri') else '',
    })

return artists

def extract_album_data(entity: dict) -> dict:
"""Extract album information including images."""
album_data = entity.get('albumOfTrack', {})
if not album_data:
return {}

# Extract cover art images
images = []
cover_art = album_data.get('coverArt', {}).get('sources', [])
for img in cover_art:
    images.append({
        'url': img.get('url', ''),
        'width': img.get('width', 0),
        'height': img.get('height', 0),
    })

return {
    'name': album_data.get('name', ''),
    'uri': album_data.get('uri', ''),
    'id': album_data.get('uri', '').split(':')[-1] if album_data.get('uri') else '',
    'images': images,
    'release_date': self._extract_release_date(album_data),
}

Interactive Checkpoint: After implementing basic extraction, tag me (@AliAkhtari78):

  • Show me sample output from each extractor function
  • Ask about any unexpected data structures you encounter
  • Request guidance on handling edge cases or missing fields

Step 2.5: Implement Advanced Features

Focus on the exciting new features:

def extract_lyrics_data(entity: dict) -> dict:
    """Extract synchronized lyrics with timing information."""
    lyrics_data = entity.get('lyrics', {})
    if not lyrics_data:
        return None
# Parse synchronized lyrics lines
lines = []
for line in lyrics_data.get('lines', []):
    lines.append({
        'start_time_ms': line.get('startTimeMs', 0),
        'words': line.get('words', ''),
        'end_time_ms': line.get('endTimeMs', 0),
    })

return {
    'sync_type': lyrics_data.get('syncType', ''),
    'lines': lines,
    'provider': lyrics_data.get('provider', ''),
    'language': lyrics_data.get('language', ''),
}

def extract_preview_url(entity: dict) -> str:
"""Extract audio preview URL."""
audio_preview = entity.get('audioPreview', {})
return audio_preview.get('url', '') if audio_preview else ''


🧪 Phase 3: Testing and Validation

Step 3.1: Unit Test Development

Create comprehensive test cases:

def test_extract_track_data_from_page():
    """Test the main extraction function."""
    # Load test fixture
    with open('tests/fixtures/html/track_modern.html', 'r') as f:
        html = f.read()
# Extract data
result = extract_track_data_from_page(html)

# Load expected results
with open('tests/fixtures/json/track_expected.json', 'r') as f:
    expected = json.load(f)

# Validate each field systematically
assert result['name'] == expected['name'], f"Name mismatch: {result['name']} != {expected['name']}"
assert result['id'] == expected['id'], f"ID mismatch: {result['id']} != {expected['id']}"

# Test lyrics specifically (new feature)
if 'lyrics' in expected:
    assert 'lyrics' in result, "Lyrics data missing from result"
    assert len(result['lyrics']['lines']) > 0, "No lyrics lines extracted"
    
print("✅ All fixture tests passed!")

Step 3.2: Live URL Validation 🌐

Test with current Spotify data:

def test_live_spotify_urls():
    """Test extraction with live Spotify URLs."""
    import requests
    from spotify_scraper.browsers.requests_browser import RequestsBrowser
    from spotify_scraper.auth.session import Session
# Create browser for testing
session = Session()
browser = RequestsBrowser(session=session)

test_urls = [
    "https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv",
    "https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3",
]

for url in test_urls:
    try:
        # Get live page content
        html = browser.get_page_content(url)
        
        # Extract track data
        result = extract_track_data_from_page(html)
        
        # Validate result
        assert result.get('name'), f"No track name extracted from {url}"
        assert result.get('id'), f"No track ID extracted from {url}"
        
        print(f"✅ Successfully extracted: {result['name']} from {url}")
        
    except Exception as e:
        print(f"❌ Failed to extract from {url}: {e}")
        # Don't fail the test, just report

Interactive Checkpoint: After testing, tag me (@AliAkhtari78):

  • Report results from both fixture and live URL tests
  • Share any discrepancies between expected and actual data
  • Ask for guidance on failing test cases

Step 3.3: Error Handling Validation 🛡️

Test robustness with edge cases:

def test_error_handling():
    """Test graceful handling of problematic inputs."""
# Test cases for robust error handling
test_cases = [
    ("", "Empty HTML"),
    ("<html></html>", "HTML without __NEXT_DATA__"),
    ("<script id='__NEXT_DATA__'>invalid json</script>", "Invalid JSON"),
    ("<script id='__NEXT_DATA__'>{}</script>", "Empty JSON"),
]

for html_content, description in test_cases:
    try:
        result = extract_track_data_from_page(html_content)
        # Should return error data, not crash
        assert 'ERROR' in result, f"Should return error for: {description}"
        print(f"✅ Gracefully handled: {description}")
        
    except Exception as e:
        print(f"❌ Crashed on {description}: {e}")


🔧 Phase 4: Integration and Optimization

Step 4.1: Performance Testing

Measure and optimize extraction speed:

import time

def benchmark_extraction():
"""Benchmark extraction performance."""
with open('tests/fixtures/html/track_modern.html', 'r') as f:
html = f.read()

# Warm up
extract_track_data_from_page(html)

# Benchmark multiple runs
start_time = time.time()
for _ in range(100):
    result = extract_track_data_from_page(html)
end_time = time.time()

avg_time = (end_time - start_time) / 100
print(f"Average extraction time: {avg_time:.4f} seconds")

# Target: < 0.1 seconds per extraction
if avg_time > 0.1:
    print("⚠️ Consider optimization - extraction is slow")
else:
    print("✅ Performance is acceptable")

Step 4.2: Memory Usage Testing 💾

Ensure efficient memory usage:

import psutil
import os

def test_memory_usage():
"""Test memory efficiency of extraction."""
process = psutil.Process(os.getpid())

# Baseline memory
baseline = process.memory_info().rss / 1024 / 1024  # MB

# Load large HTML content multiple times
with open('tests/fixtures/html/track_modern.html', 'r') as f:
    html = f.read()

# Extract data multiple times
for i in range(50):
    result = extract_track_data_from_page(html)
    if i % 10 == 0:
        current = process.memory_info().rss / 1024 / 1024
        print(f"Iteration {i}: {current:.1f} MB (+{current-baseline:.1f} MB)")

final = process.memory_info().rss / 1024 / 1024
print(f"Memory growth: {final-baseline:.1f} MB")


🆘 Troubleshooting Guide

Common Issues and Solutions

Issue Symptoms Solution Strategy
JSON Structure Changed KeyError on expected paths Research current structure, update navigation paths
Missing Lyrics No lyrics in any test cases Check if lyrics require authentication, implement fallback
Image URLs Invalid 404 errors on image links Validate URL format, check different image sizes
Performance Issues Slow extraction (>0.1s) Profile code, optimize JSON parsing, cache BeautifulSoup

When to Ask for Help 🤝

Immediately ask me (@AliAkhtari78) if you encounter:

  1. Structural Changes: Spotify's JSON structure differs significantly from fixtures
  2. Authentication Issues: Live URLs return different data than expected
  3. Test Failures: More than 1 test case fails after implementing fixes
  4. Data Quality Issues: Extracted data seems incomplete or incorrect
  5. Performance Problems: Extraction takes longer than 0.1 seconds consistently
</html>

Fixes #28.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

**Original Pull Request:** https://github.com/AliAkhtari78/SpotifyScraper/pull/29 **State:** closed **Merged:** No --- Thanks for assigning this issue to me. I'm starting to work on it and will keep this PR's description up to date as I form a plan and make progress. Original issue description: > > &gt; **Mission Critical**: Build the data extraction pipeline that transforms Spotify's HTML into structured track data with lyrics support > > --- > > ## 📊 **Project Context &amp; Importance** > > ### **What This Task Accomplishes** > This task completes the **core data extraction pipeline** for SpotifyScraper 2.0. You're implementing the critical function that: > - Transforms raw Spotify HTML into structured, usable data > - Enables the exciting new **lyrics extraction** feature > - Powers all future track-based functionality in the library > - Validates that our modern architecture actually works end-to-end > > ### **Why This Task Matters** > Without robust JSON parsing, the entire SpotifyScraper library is just empty scaffolding. This implementation: > - **Unlocks user value**: Enables actual music data extraction > - **Validates architecture**: Proves our new modular design works > - **Enables expansion**: Creates the pattern for album/artist/playlist extractors > - **Delivers innovation**: Adds lyrics with timing that wasn't in v1.0 > > --- > > ## 🎯 **Mission Objectives** > > ### **Primary Goals** > - [ ] **Parse Track Metadata**: Extract name, ID, URI, duration, artists, album info > - [ ] **Extract Media URLs**: Get preview audio and cover art links > - [ ] **Parse Lyrics Data**: Extract synchronized lyrics with timing information > - [ ] **Handle Edge Cases**: Gracefully handle missing or malformed data > - [ ] **Pass All Tests**: Meet 100% success criteria for validation > > ### **Success Metrics** > | Metric | Target | How to Measure | > |--------|--------|----------------| > | Fixture Test | 100% match | Output matches `track_expected.json` exactly | > | Live URL Test | 3/3 working | All test URLs extract successfully | > | Unit Tests | All passing | `pytest test_track_extractor.py` green | > | Error Handling | Graceful degradation | Returns partial data instead of crashing | > > --- > > ## 🔍 **Phase 1: Research and Discovery** > > ### **Step 1.1: Understand Current Spotify Architecture** 🌐 > > **Use your web search capabilities to research the current Spotify web structure:** > > > # Research these topics systematically: > 1. "Spotify web player __NEXT_DATA__ structure 2025" > 2. "Spotify embed page JSON data format" > 3. "Spotify track page HTML structure changes" > </code></pre> > <p><strong>Interactive Checkpoint</strong>: After researching, analyze what you find and <strong>ask me (@AliAkhtari78)</strong>:</p> > <ul> > <li>Are there any major changes in Spotify's structure since our fixtures were created?</li> > <li>Should we update our test fixtures based on current live data?</li> > <li>Are there new data fields we should consider extracting?</li> > </ul> > <h3><strong>Step 1.2: Analyze Existing Test Fixtures</strong> 📊</h3> > <p><strong>Examine the provided test data systematically:</strong></p> > <ol> > <li><strong>Load and inspect the fixture HTML</strong>:</li> > </ol> > <pre><code class="language-python">with open('tests/fixtures/html/track_modern.html', 'r', encoding='utf-8') as f: > html_content = f.read() > > # Find the __NEXT_DATA__ script tag > from bs4 import BeautifulSoup > soup = BeautifulSoup(html_content, 'html.parser') > script_tag = soup.find('script', {'id': '__NEXT_DATA__'}) > print("JSON structure preview:") > print(script_tag.string[:500] + "...") > </code></pre> > <ol start="2"> > <li><strong>Compare with expected output</strong>:</li> > </ol> > <pre><code class="language-python">import json > with open('tests/fixtures/json/track_expected.json', 'r') as f: > expected = json.load(f) > > print("Expected output structure:") > for key in expected.keys(): > print(f" {key}: {type(expected[key])}") > </code></pre> > <p><strong>Interactive Checkpoint</strong>: After analysis, <strong>tag me (@AliAkhtari78)</strong> with:</p> > <ul> > <li>Any discrepancies you find between fixture and expected output</li> > <li>Questions about data field priorities or edge cases</li> > <li>Suggestions for additional test cases</li> > </ul> > <h3><strong>Step 1.3: Test Live Spotify URLs</strong> 🎵</h3> > <p><strong>Use your browser tools to fetch current live data:</strong></p> > <pre><code class="language-python">import requests > > test_urls = [ > "https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv", # Bohemian Rhapsody > "https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3", # Shape of You > "https://open.spotify.com/embed/track/1Ax3zx5TJBRi4Ol8hPU9N8", # Anti-Hero > ] > > for url in test_urls: > try: > response = requests.get(url, headers={ > 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' > }) > print(f"URL: {url}") > print(f"Status: {response.status_code}") > > # Look for __NEXT_DATA__ > if '__NEXT_DATA__' in response.text: > print("✅ Contains __NEXT_DATA__") > else: > print("❌ No __NEXT_DATA__ found") > > except Exception as e: > print(f"❌ Error accessing {url}: {e}") > </code></pre> > <p><strong>Interactive Checkpoint</strong>: <strong>Ask me (@AliAkhtari78)</strong>:</p> > <ul> > <li>If any test URLs fail to load or have structural differences</li> > <li>Whether you should create new test fixtures from current live data</li> > <li>If you need different test URLs for better coverage</li> > </ul> > <hr> > <h2>🛠️ <strong>Phase 2: Implementation Strategy</strong></h2> > <h3><strong>Step 2.1: Design the Parsing Pipeline</strong> 🏗️</h3> > <p><strong>Map out your implementation approach:</strong></p> > <pre><code class="language-python">def extract_track_data_from_page(html_content: str) -&gt; TrackData: > """ > Implementation roadmap: > > 1. Extract __NEXT_DATA__ JSON from HTML > 2. Navigate to track entity in JSON structure > 3. Parse basic track metadata (name, ID, URI, etc.) > 4. Extract artist information > 5. Parse album data and images > 6. Extract audio preview URLs > 7. Parse lyrics with timing (if available) > 8. Handle missing data gracefully > 9. Return structured TrackData object > """ > pass > </code></pre> > <p><strong>Create a development checklist</strong>:</p> > <ul> > <li>[ ] JSON extraction from HTML works</li> > <li>[ ] Basic track fields parsing</li> > <li>[ ] Artist data extraction</li> > <li>[ ] Album data with images</li> > <li>[ ] Preview URL extraction</li> > <li>[ ] Lyrics parsing with timing</li> > <li>[ ] Error handling for missing data</li> > <li>[ ] Type compliance with TrackData</li> > </ul> > <h3><strong>Step 2.2: Implement Core JSON Extraction</strong> 📄</h3> > <p><strong>Start with the foundation - getting JSON from HTML:</strong></p> > <pre><code class="language-python">import json > from bs4 import BeautifulSoup > from spotify_scraper.core.exceptions import ParsingError > > def extract_next_data_json(html_content: str) -&gt; dict: > """Extract and parse __NEXT_DATA__ JSON from Spotify page.""" > try: > soup = BeautifulSoup(html_content, 'html.parser') > script_tag = soup.find('script', {'id': '__NEXT_DATA__'}) > > if not script_tag or not script_tag.string: > raise ParsingError("No __NEXT_DATA__ script tag found") > > return json.loads(script_tag.string) > > except json.JSONDecodeError as e: > raise ParsingError(f"Invalid JSON in __NEXT_DATA__: {e}") > except Exception as e: > raise ParsingError(f"Failed to extract JSON: {e}") > </code></pre> > <p><strong>Validation checkpoint</strong>:</p> > <pre><code class="language-python"># Test with your fixture > with open('tests/fixtures/html/track_modern.html', 'r') as f: > html = f.read() > > json_data = extract_next_data_json(html) > print(f"Successfully extracted JSON with {len(json_data)} top-level keys") > print(f"Keys: {list(json_data.keys())}") > </code></pre> > <h3><strong>Step 2.3: Navigate JSON Structure</strong> 🗺️</h3> > <p><strong>Implement the path navigation to track data:</strong></p> > <pre><code class="language-python">def get_track_entity(json_data: dict) -&gt; dict: > """Navigate to track entity in Spotify JSON structure.""" > try: > # Follow the path: props.pageProps.state.data.entity > entity = (json_data > .get('props', {}) > .get('pageProps', {}) > .get('state', {}) > .get('data', {}) > .get('entity', {})) > > if not entity or entity.get('type') != 'track': > raise ParsingError("Track entity not found or invalid type") > > return entity > > except Exception as e: > raise ParsingError(f"Failed to navigate to track entity: {e}") > </code></pre> > <p><strong>Validation checkpoint</strong>:</p> > <pre><code class="language-python">json_data = extract_next_data_json(html_content) > track_entity = get_track_entity(json_data) > print(f"Track entity keys: {list(track_entity.keys())}") > print(f"Track name: {track_entity.get('name', 'NOT FOUND')}") > </code></pre> > <h3><strong>Step 2.4: Implement Systematic Data Extraction</strong> 📊</h3> > <p><strong>Build extractors for each data category:</strong></p> > <pre><code class="language-python">def extract_basic_track_info(entity: dict) -&gt; dict: > """Extract core track information.""" > return { > 'id': entity.get('id', ''), > 'name': entity.get('name', ''), > 'uri': entity.get('uri', ''), > 'type': 'track', > 'duration_ms': self._safe_extract_duration(entity), > 'is_playable': entity.get('playability', {}).get('playable', False), > 'is_explicit': self._extract_explicit_flag(entity), > } > > def extract_artists_data(entity: dict) -&gt; list: > """Extract artist information.""" > artists = [] > artists_data = entity.get('artists', {}).get('items', []) > > for artist in artists_data: > profile = artist.get('profile', {}) > artists.append({ > 'name': profile.get('name', ''), > 'uri': artist.get('uri', ''), > 'id': artist.get('uri', '').split(':')[-1] if artist.get('uri') else '', > }) > > return artists > > def extract_album_data(entity: dict) -&gt; dict: > """Extract album information including images.""" > album_data = entity.get('albumOfTrack', {}) > if not album_data: > return {} > > # Extract cover art images > images = [] > cover_art = album_data.get('coverArt', {}).get('sources', []) > for img in cover_art: > images.append({ > 'url': img.get('url', ''), > 'width': img.get('width', 0), > 'height': img.get('height', 0), > }) > > return { > 'name': album_data.get('name', ''), > 'uri': album_data.get('uri', ''), > 'id': album_data.get('uri', '').split(':')[-1] if album_data.get('uri') else '', > 'images': images, > 'release_date': self._extract_release_date(album_data), > } > </code></pre> > <p><strong>Interactive Checkpoint</strong>: After implementing basic extraction, <strong>tag me (@AliAkhtari78)</strong>:</p> > <ul> > <li>Show me sample output from each extractor function</li> > <li>Ask about any unexpected data structures you encounter</li> > <li>Request guidance on handling edge cases or missing fields</li> > </ul> > <h3><strong>Step 2.5: Implement Advanced Features</strong> ⭐</h3> > <p><strong>Focus on the exciting new features:</strong></p> > <pre><code class="language-python">def extract_lyrics_data(entity: dict) -&gt; dict: > """Extract synchronized lyrics with timing information.""" > lyrics_data = entity.get('lyrics', {}) > if not lyrics_data: > return None > > # Parse synchronized lyrics lines > lines = [] > for line in lyrics_data.get('lines', []): > lines.append({ > 'start_time_ms': line.get('startTimeMs', 0), > 'words': line.get('words', ''), > 'end_time_ms': line.get('endTimeMs', 0), > }) > > return { > 'sync_type': lyrics_data.get('syncType', ''), > 'lines': lines, > 'provider': lyrics_data.get('provider', ''), > 'language': lyrics_data.get('language', ''), > } > > def extract_preview_url(entity: dict) -&gt; str: > """Extract audio preview URL.""" > audio_preview = entity.get('audioPreview', {}) > return audio_preview.get('url', '') if audio_preview else '' > </code></pre> > <hr> > <h2>🧪 <strong>Phase 3: Testing and Validation</strong></h2> > <h3><strong>Step 3.1: Unit Test Development</strong> ✅</h3> > <p><strong>Create comprehensive test cases:</strong></p> > <pre><code class="language-python">def test_extract_track_data_from_page(): > """Test the main extraction function.""" > # Load test fixture > with open('tests/fixtures/html/track_modern.html', 'r') as f: > html = f.read() > > # Extract data > result = extract_track_data_from_page(html) > > # Load expected results > with open('tests/fixtures/json/track_expected.json', 'r') as f: > expected = json.load(f) > > # Validate each field systematically > assert result['name'] == expected['name'], f"Name mismatch: {result['name']} != {expected['name']}" > assert result['id'] == expected['id'], f"ID mismatch: {result['id']} != {expected['id']}" > > # Test lyrics specifically (new feature) > if 'lyrics' in expected: > assert 'lyrics' in result, "Lyrics data missing from result" > assert len(result['lyrics']['lines']) &gt; 0, "No lyrics lines extracted" > > print("✅ All fixture tests passed!") > </code></pre> > <h3><strong>Step 3.2: Live URL Validation</strong> 🌐</h3> > <p><strong>Test with current Spotify data:</strong></p> > <pre><code class="language-python">def test_live_spotify_urls(): > """Test extraction with live Spotify URLs.""" > import requests > from spotify_scraper.browsers.requests_browser import RequestsBrowser > from spotify_scraper.auth.session import Session > > # Create browser for testing > session = Session() > browser = RequestsBrowser(session=session) > > test_urls = [ > "https://open.spotify.com/embed/track/4u7EnebtmKWzUH433cf5Qv", > "https://open.spotify.com/embed/track/7qiZfU4dY1lWllzX7mPBI3", > ] > > for url in test_urls: > try: > # Get live page content > html = browser.get_page_content(url) > > # Extract track data > result = extract_track_data_from_page(html) > > # Validate result > assert result.get('name'), f"No track name extracted from {url}" > assert result.get('id'), f"No track ID extracted from {url}" > > print(f"✅ Successfully extracted: {result['name']} from {url}") > > except Exception as e: > print(f"❌ Failed to extract from {url}: {e}") > # Don't fail the test, just report > </code></pre> > <p><strong>Interactive Checkpoint</strong>: After testing, <strong>tag me (@AliAkhtari78)</strong>:</p> > <ul> > <li>Report results from both fixture and live URL tests</li> > <li>Share any discrepancies between expected and actual data</li> > <li>Ask for guidance on failing test cases</li> > </ul> > <h3><strong>Step 3.3: Error Handling Validation</strong> 🛡️</h3> > <p><strong>Test robustness with edge cases:</strong></p> > <pre><code class="language-python">def test_error_handling(): > """Test graceful handling of problematic inputs.""" > > # Test cases for robust error handling > test_cases = [ > ("", "Empty HTML"), > ("&lt;html&gt;&lt;/html&gt;", "HTML without __NEXT_DATA__"), > ("&lt;script id='__NEXT_DATA__'&gt;invalid json&lt;/script&gt;", "Invalid JSON"), > ("&lt;script id='__NEXT_DATA__'&gt;{}&lt;/script&gt;", "Empty JSON"), > ] > > for html_content, description in test_cases: > try: > result = extract_track_data_from_page(html_content) > # Should return error data, not crash > assert 'ERROR' in result, f"Should return error for: {description}" > print(f"✅ Gracefully handled: {description}") > > except Exception as e: > print(f"❌ Crashed on {description}: {e}") > </code></pre> > <hr> > <h2>🔧 <strong>Phase 4: Integration and Optimization</strong></h2> > <h3><strong>Step 4.1: Performance Testing</strong> ⚡</h3> > <p><strong>Measure and optimize extraction speed:</strong></p> > <pre><code class="language-python">import time > > def benchmark_extraction(): > """Benchmark extraction performance.""" > with open('tests/fixtures/html/track_modern.html', 'r') as f: > html = f.read() > > # Warm up > extract_track_data_from_page(html) > > # Benchmark multiple runs > start_time = time.time() > for _ in range(100): > result = extract_track_data_from_page(html) > end_time = time.time() > > avg_time = (end_time - start_time) / 100 > print(f"Average extraction time: {avg_time:.4f} seconds") > > # Target: &lt; 0.1 seconds per extraction > if avg_time &gt; 0.1: > print("⚠️ Consider optimization - extraction is slow") > else: > print("✅ Performance is acceptable") > </code></pre> > <h3><strong>Step 4.2: Memory Usage Testing</strong> 💾</h3> > <p><strong>Ensure efficient memory usage:</strong></p> > <pre><code class="language-python">import psutil > import os > > def test_memory_usage(): > """Test memory efficiency of extraction.""" > process = psutil.Process(os.getpid()) > > # Baseline memory > baseline = process.memory_info().rss / 1024 / 1024 # MB > > # Load large HTML content multiple times > with open('tests/fixtures/html/track_modern.html', 'r') as f: > html = f.read() > > # Extract data multiple times > for i in range(50): > result = extract_track_data_from_page(html) > if i % 10 == 0: > current = process.memory_info().rss / 1024 / 1024 > print(f"Iteration {i}: {current:.1f} MB (+{current-baseline:.1f} MB)") > > final = process.memory_info().rss / 1024 / 1024 > print(f"Memory growth: {final-baseline:.1f} MB") > </code></pre> > <hr> > <h2>🆘 <strong>Troubleshooting Guide</strong></h2> > <h3><strong>Common Issues and Solutions</strong></h3> > > Issue | Symptoms | Solution Strategy > -- | -- | -- > JSON Structure Changed | KeyError on expected paths | Research current structure, update navigation paths > Missing Lyrics | No lyrics in any test cases | Check if lyrics require authentication, implement fallback > Image URLs Invalid | 404 errors on image links | Validate URL format, check different image sizes > Performance Issues | Slow extraction (>0.1s) | Profile code, optimize JSON parsing, cache BeautifulSoup > > > <h3><strong>When to Ask for Help</strong> 🤝</h3> > <p><strong>Immediately ask me (@AliAkhtari78) if you encounter:</strong></p> > <ol> > <li><strong>Structural Changes</strong>: Spotify's JSON structure differs significantly from fixtures</li> > <li><strong>Authentication Issues</strong>: Live URLs return different data than expected</li> > <li><strong>Test Failures</strong>: More than 1 test case fails after implementing fixes</li> > <li><strong>Data Quality Issues</strong>: Extracted data seems incomplete or incorrect</li> > <li><strong>Performance Problems</strong>: Extraction takes longer than 0.1 seconds consistently</li> > </ol> > <!--EndFragment--> > </body> > </html> Fixes #28. --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.
kerem 2026-03-07 19:32:03 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/SpotifyScraper#18
No description provided.