urlchecker.core.fileproc

Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat

This source code is licensed under the terms of the MIT license. For a copy, see <https://opensource.org/licenses/MIT>.

urlchecker.core.fileproc.check_file_type(file_path: str, file_types: List[str]) → bool[source]

Check file type to assert that only file with certain predefined extensions are checked. We currently support an extension verbatim, or regular expression to match the filename. For example, .* matches all hidden files, and *.html matches an html file.

Args:
  • file_path (str) : path to file.
  • file_types (list) : list of file extensions to accept.
Returns:
(bool) true if file type is supported else false.

Collect all links in a file.

Args:
  • file_path (str) : path to file.
  • unique (bool) : specify whether to filter out duplicate links.
Returns:
(list) list of links/ urls in a file.
urlchecker.core.fileproc.get_file_paths(base_path: str, file_types: List[str], exclude_files: List[str] = None, include_patterns: List[str] = None) → List[str][source]

Get path to all files under a give directory and its subfolders.

Args:
  • base_path (str) : base path.
  • file_types (list) : list of file extensions to accept.
  • include_patterns (list) : list of files and patterns to include.
  • exclude_files (list) : list of files or patterns to exclude
Returns:
(list) list of file paths.
urlchecker.core.fileproc.include_file(file_path: str, exclude_patterns: List[str] = None, include_patterns: List[str] = None) → bool[source]

Check a file path for inclusion based on an OR regular expression. The user is currently not notified if a file is marked for removal.

Args:
  • file_path (str) : a file path to check if should be included.
  • exclude_patterns (list) : list of patterns to exclude.
  • include_patterns (list) : list of patterns to include.
Returns:
(bool) boolean indicating if the URL should be excluded (not tested).
urlchecker.core.fileproc.remove_empty(file_list: List[str]) → List[str][source]

Given a file list, return only those that aren’t empty string or None.

Args:
  • file_list (list): a list of files to remove None or empty string from.
Returns:
(list) list of (non None or empty string) contents.