mirror of
https://github.com/RD17/ambar.git
synced 2026-04-25 15:35:49 +03:00
[GH-ISSUE #33] Feature request: RSS feed crawler #33
Labels
No labels
$$ Paid Support
bug
bug
enhancement
help wanted
invalid
pull-request
question
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ambar#33
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jenkin on GitHub (May 11, 2017).
Original GitHub issue: https://github.com/RD17/ambar/issues/33
When items contain enclosures linking to available documents.
@sochix commented on GitHub (May 28, 2017):
Hi @jenkin . Can you please describe this crawler a bit? Should it fetch all linked documents? And if the document an html page should it fetch it too?
@jenkin commented on GitHub (May 30, 2017):
RSS specifications require the mime type as attribute of enclosure tag: "It has three required attributes: url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type." So the crawler should fetch only enclosures with a supported (by Ambar) mime type (ie. application/pdf). Other types, such as mp3 in podcast feeds, will be ignored.
@sochix commented on GitHub (May 30, 2017):
@jenkin got it! Can you please also describe use case for the RSS crawler? What kind of data you want to crawl with it? Newsletters or what?
@sochix commented on GitHub (Apr 19, 2018):
Check our support options