[GH-ISSUE #223] Indexing files without extensions #217

Closed
opened 2026-02-27 15:55:41 +03:00 by kerem · 2 comments
Owner

Originally created by @tristanolive on GitHub (Mar 7, 2019).
Original GitHub issue: https://github.com/RD17/ambar/issues/223

We have a large set of files that do not have file extensions such as .txt or .png or .pdf, but this information is available in the file metadata. Ambar reports the following in the log:

path/to/file ignoring. Rule: File should have extension

Is there a configuration or other method by which to enable a crawler to index these files?

Originally created by @tristanolive on GitHub (Mar 7, 2019). Original GitHub issue: https://github.com/RD17/ambar/issues/223 We have a large set of files that do not have file extensions such as .txt or .png or .pdf, but this information is available in the file metadata. Ambar reports the following in the log: `path/to/file ignoring. Rule: File should have extension` Is there a configuration or other method by which to enable a crawler to index these files?
kerem 2026-02-27 15:55:41 +03:00
  • closed this issue
  • added the
    wontfix
    label
Author
Owner

@stale[bot] commented on GitHub (Mar 22, 2019):

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

<!-- gh-comment-id:475676789 --> @stale[bot] commented on GitHub (Mar 22, 2019): This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Author
Owner

@tristanolive commented on GitHub (Mar 29, 2019):

The file command on linux provides detection of file types, which could then be used to determine what processing is necessary without relying on an extension. For example, running file ... | awk {'print $2 "_" $3'} gives something like:

  • ASCII_text
  • PDF_document
  • JPEG_image
  • gzip_compressed

Could this be a simple addition to where file type filters are currently in place? I think it would go a long way in the maturity of this product.

<!-- gh-comment-id:478102739 --> @tristanolive commented on GitHub (Mar 29, 2019): The file command on linux provides detection of file types, which could then be used to determine what processing is necessary without relying on an extension. For example, running `file ... | awk {'print $2 "_" $3'}` gives something like: - ASCII_text - PDF_document - JPEG_image - gzip_compressed Could this be a simple addition to where file type filters are currently in place? I think it would go a long way in the maturity of this product.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#217
No description provided.