aparts.src package
Submodules
aparts.src.APT module
aparts.src.construct_keylist module
aparts.src.deduplication module
aparts.src.download_pdf module
aparts.src.extract_references module
aparts.src.graph module
aparts.src.query_expansion module
aparts.src.scholar_record_extraction module
aparts.src.semantic_scholar module
aparts.src.summarization module
aparts.src.weighted_tagging module
- aparts.src.weighted_tagging.clean_end_section(patterns: dict = {'abstract': 'abstract:\\s*(.*?)\\s', 'conclusion': 'conclusion\\s*(.*?)\\s', 'discussion': 'discussion\\s*(.*?)\\s', 'introduction': 'introduction\\s*(.*?)\\s', 'keywords': 'key(?:word| words)(?::|\\s+index:)?\\s*(.*?)\\s', 'methods': '(materials(?:\\s+&)?\\s+)?methods\\s*(.*)', 'references': '(?<!taxonomic )(?:taxonomic\\s)?(?:references cited|references(?!\\s*[A-Z][^a-z]))(?:,(?!$)|.(?!$)|.(?!\\s\\w)|[^.,\\s])(?![^\\s]*\\shttp)\\s*([^.,\\s]+)', 'results': 'results\\s*(.*?)\\s'}, sections: dict = {'abstract': '', 'conclusion': '', 'discussion': '', 'introduction': '', 'keywords': '', 'methods': '', 'references': '', 'results': ''}) dict[source]
Trim end of section by matching with the beginning of the next section
Parameters: sections (dict): The source dictionary
Returns: sections (dict): Dictionary of corrected sections
- aparts.src.weighted_tagging.count_keyword_occurrences(section_dictionary: dict, keylist: list) dict[source]
Returns a dictionary of occurrence per keyword per section.
Parameters:
section_dictionary (dict): An object containing the input text split into sections.
keylist (list): List of keywords to count
Returns:
word_counts (dict): dictionary of occurrence per keyword per section.
- aparts.src.weighted_tagging.denest_and_order_dict(dictionary: dict) dict[source]
Denests a dictionary and orders it in descending order based on the values of the leaf nodes.
Parameters: dictionary (dict): The source dictionary
Returns: dict: The updated dictionary with all nested keys flattened and sorted in descending order based on leaf node values.
- aparts.src.weighted_tagging.extract_sections(text: str, patterns: dict = {'abstract': 'abstract:\\s*(.*?)\\s', 'conclusion': 'conclusion\\s*(.*?)\\s', 'discussion': 'discussion\\s*(.*?)\\s', 'introduction': 'introduction\\s*(.*?)\\s', 'keywords': 'key(?:word| words)(?::|\\s+index:)?\\s*(.*?)\\s', 'methods': '(materials(?:\\s+&)?\\s+)?methods\\s*(.*)', 'references': '(?<!taxonomic )(?:taxonomic\\s)?(?:references cited|references(?!\\s*[A-Z][^a-z]))(?:,(?!$)|.(?!$)|.(?!\\s\\w)|[^.,\\s])(?![^\\s]*\\shttp)\\s*([^.,\\s]+)', 'results': 'results\\s*(.*?)\\s'}, sections: dict = {'abstract': '', 'conclusion': '', 'discussion': '', 'introduction': '', 'keywords': '', 'methods': '', 'references': '', 'results': ''}) dict[source]
Extracts the sections from the input text and returns a dictionary where each key is a section and the value is the text for that section.
Parameters: text (str): The source string patterns (dict): dictionary of regex headers to match sections (dict): empty dictionary of headers
Returns: sections (dict): Dictionary of found sections
- aparts.src.weighted_tagging.filter_values(keyword_counts: dict, lower: int = 0) dict[source]
Remove all keys with a value of 0 from a dictionary (nested or not)
Parameters: keyword_counts (dict): The source dictionary
Returns: dict: The updated dictionary with 0 values removed
- aparts.src.weighted_tagging.nested_dict_to_dataframe(nested_dict: Dict[str, Dict[str, int]]) DataFrame[source]
- aparts.src.weighted_tagging.prepare_bytes_for_pattern(text: str) str[source]
removes the artifacts from bytes decoding from a given string.
Parameters: text: The bytes-like object to prepare.
Returns: str: The prepared string.
- aparts.src.weighted_tagging.print_nested_dict(dictionary: dict, indent=0) None[source]
Prints a dictionary by key: value. If the dictionary is nested it prints it as a line for the key followed by a line of nested key: nested value for each entry within the key.
Parameters: dictionary (dict): The dictionary to be printed
Returns: None
- aparts.src.weighted_tagging.save_dataframe(dataframe, folder: str)[source]
Saves the provided dataframe in the provided folder with headers, adding to the file if already present
- aparts.src.weighted_tagging.split_text_to_sections(text: str) dict[source]
splits a bytes like text file into sections based on the headers of a scientific article
Parameters: text (str): text to be split into sections
Returns: dict (dict:str): Dictionary of the sections
- aparts.src.weighted_tagging.weigh_keywords(nested_dict) dict[source]
Weighs a nested dictionary by multiplying the value in the last column based on the first column. Weighing is determined as follows: Abstract: 4, Discussion: 3, Methods|Results: 2, Introduction:1, References: 0
Parameters: nested_dict (dict): The source dictionary
Returns: nested_dict (dict): Dictionary with the weighed values