[GH-ISSUE #33] Feature request: RSS feed crawler #33

Closed
opened 2026-02-27 15:54:34 +03:00 by kerem · 4 comments
Owner

Originally created by @jenkin on GitHub (May 11, 2017).
Original GitHub issue: https://github.com/RD17/ambar/issues/33

When items contain enclosures linking to available documents.

Originally created by @jenkin on GitHub (May 11, 2017). Original GitHub issue: https://github.com/RD17/ambar/issues/33 When items contain [enclosures](http://cyber.harvard.edu/rss/rss.html#ltenclosuregtSubelementOfLtitemgt) linking to available documents.
kerem 2026-02-27 15:54:34 +03:00
Author
Owner

@sochix commented on GitHub (May 28, 2017):

Hi @jenkin . Can you please describe this crawler a bit? Should it fetch all linked documents? And if the document an html page should it fetch it too?

<!-- gh-comment-id:304493433 --> @sochix commented on GitHub (May 28, 2017): Hi @jenkin . Can you please describe this crawler a bit? Should it fetch all linked documents? And if the document an html page should it fetch it too?
Author
Owner

@jenkin commented on GitHub (May 30, 2017):

RSS specifications require the mime type as attribute of enclosure tag: "It has three required attributes: url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type." So the crawler should fetch only enclosures with a supported (by Ambar) mime type (ie. application/pdf). Other types, such as mp3 in podcast feeds, will be ignored.

<!-- gh-comment-id:304795499 --> @jenkin commented on GitHub (May 30, 2017): RSS specifications require the mime type as attribute of [enclosure tag](http://cyber.harvard.edu/rss/rss.html#ltenclosuregtSubelementOfLtitemgt): "It has three required attributes: url says where the enclosure is located, length says how big it is in bytes, and type says what its type is, a standard MIME type." So the crawler should fetch only enclosures with a supported (by Ambar) mime type (ie. application/pdf). Other types, such as mp3 in podcast feeds, will be ignored.
Author
Owner

@sochix commented on GitHub (May 30, 2017):

@jenkin got it! Can you please also describe use case for the RSS crawler? What kind of data you want to crawl with it? Newsletters or what?

<!-- gh-comment-id:304800553 --> @sochix commented on GitHub (May 30, 2017): @jenkin got it! Can you please also describe use case for the RSS crawler? What kind of data you want to crawl with it? Newsletters or what?
Author
Owner

@sochix commented on GitHub (Apr 19, 2018):

Check our support options

<!-- gh-comment-id:382659375 --> @sochix commented on GitHub (Apr 19, 2018): [Check our support options](https://github.com/RD17/ambar#support)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#33
No description provided.