aparts.src package

Submodules

aparts.src.APT module

aparts.src.construct_keylist module

aparts.src.deduplication module

aparts.src.download_pdf module

aparts.src.extract_references module

aparts.src.graph module

aparts.src.query_expansion module

aparts.src.scholar_record_extraction module

aparts.src.semantic_scholar module

aparts.src.summarization module

aparts.src.weighted_tagging module

aparts.src.weighted_tagging.clean_end_section(patterns: dict = {'abstract': 'abstract:\\s*(.*?)\\s', 'conclusion': 'conclusion\\s*(.*?)\\s', 'discussion': 'discussion\\s*(.*?)\\s', 'introduction': 'introduction\\s*(.*?)\\s', 'keywords': 'key(?:word| words)(?::|\\s+index:)?\\s*(.*?)\\s', 'methods': '(materials(?:\\s+&)?\\s+)?methods\\s*(.*)', 'references': '(?<!taxonomic )(?:taxonomic\\s)?(?:references cited|references(?!\\s*[A-Z][^a-z]))(?:,(?!$)|.(?!$)|.(?!\\s\\w)|[^.,\\s])(?![^\\s]*\\shttp)\\s*([^.,\\s]+)', 'results': 'results\\s*(.*?)\\s'}, sections: dict = {'abstract': '', 'conclusion': '', 'discussion': '', 'introduction': '', 'keywords': '', 'methods': '', 'references': '', 'results': ''}) dict[source]

Trim end of section by matching with the beginning of the next section

Parameters: sections (dict): The source dictionary

Returns: sections (dict): Dictionary of corrected sections

aparts.src.weighted_tagging.count_keyword_occurrences(section_dictionary: dict, keylist: list) dict[source]

Returns a dictionary of occurrence per keyword per section.

Parameters:

section_dictionary (dict): An object containing the input text split into sections.

keylist (list): List of keywords to count

Returns:

word_counts (dict): dictionary of occurrence per keyword per section.

aparts.src.weighted_tagging.denest_and_order_dict(dictionary: dict) dict[source]

Denests a dictionary and orders it in descending order based on the values of the leaf nodes.

Parameters: dictionary (dict): The source dictionary

Returns: dict: The updated dictionary with all nested keys flattened and sorted in descending order based on leaf node values.

aparts.src.weighted_tagging.extract_sections(text: str, patterns: dict = {'abstract': 'abstract:\\s*(.*?)\\s', 'conclusion': 'conclusion\\s*(.*?)\\s', 'discussion': 'discussion\\s*(.*?)\\s', 'introduction': 'introduction\\s*(.*?)\\s', 'keywords': 'key(?:word| words)(?::|\\s+index:)?\\s*(.*?)\\s', 'methods': '(materials(?:\\s+&)?\\s+)?methods\\s*(.*)', 'references': '(?<!taxonomic )(?:taxonomic\\s)?(?:references cited|references(?!\\s*[A-Z][^a-z]))(?:,(?!$)|.(?!$)|.(?!\\s\\w)|[^.,\\s])(?![^\\s]*\\shttp)\\s*([^.,\\s]+)', 'results': 'results\\s*(.*?)\\s'}, sections: dict = {'abstract': '', 'conclusion': '', 'discussion': '', 'introduction': '', 'keywords': '', 'methods': '', 'references': '', 'results': ''}) dict[source]

Extracts the sections from the input text and returns a dictionary where each key is a section and the value is the text for that section.

Parameters: text (str): The source string patterns (dict): dictionary of regex headers to match sections (dict): empty dictionary of headers

Returns: sections (dict): Dictionary of found sections

aparts.src.weighted_tagging.filter_values(keyword_counts: dict, lower: int = 0) dict[source]

Remove all keys with a value of 0 from a dictionary (nested or not)

Parameters: keyword_counts (dict): The source dictionary

Returns: dict: The updated dictionary with 0 values removed

aparts.src.weighted_tagging.nested_dict_to_dataframe(nested_dict: Dict[str, Dict[str, int]]) DataFrame[source]
aparts.src.weighted_tagging.open_file(file: str) str[source]
aparts.src.weighted_tagging.prepare_bytes_for_pattern(text: str) str[source]

removes the artifacts from bytes decoding from a given string.

Parameters: text: The bytes-like object to prepare.

Returns: str: The prepared string.

aparts.src.weighted_tagging.print_nested_dict(dictionary: dict, indent=0) None[source]

Prints a dictionary by key: value. If the dictionary is nested it prints it as a line for the key followed by a line of nested key: nested value for each entry within the key.

Parameters: dictionary (dict): The dictionary to be printed

Returns: None

aparts.src.weighted_tagging.remove_typographic_line_breaks(text)[source]
aparts.src.weighted_tagging.save_dataframe(dataframe, folder: str)[source]

Saves the provided dataframe in the provided folder with headers, adding to the file if already present

aparts.src.weighted_tagging.split_text_to_sections(text: str) dict[source]

splits a bytes like text file into sections based on the headers of a scientific article

Parameters: text (str): text to be split into sections

Returns: dict (dict:str): Dictionary of the sections

aparts.src.weighted_tagging.weigh_keywords(nested_dict) dict[source]

Weighs a nested dictionary by multiplying the value in the last column based on the first column. Weighing is determined as follows: Abstract: 4, Discussion: 3, Methods|Results: 2, Introduction:1, References: 0

Parameters: nested_dict (dict): The source dictionary

Returns: nested_dict (dict): Dictionary with the weighed values

Module contents