API reference#

class scrapyloganalyzer.ScrapyLogFile(name)[source]#

A representation of a Scrapy log file.

classmethod find(logs_directory, source_id, data_version)[source]#

Find and return the first matching log file for the given crawl.

Parameters:
  • logs_directory (str) – Kingfisher Collect’s project directory within Scrapyd’s logs_dir directory

  • source_id (str) – the spider’s name

  • data_version (datetime.datetime) – the crawl directory’s name, parsed as a datetime

Returns:

the first matching log file

Return type:

ocdskingfisherarchive.scrapy.ScrapyLogFile

delete()[source]#

Delete the log file and any log summary ending in .stats.

property logparser#
Returns:

the output of logparser

Return type:

dict

match(data_version)[source]#

Return whether the crawl directory’s name, parsed as a datetime, is less than 3 seconds after the log file’s start time.

Returns:

whether the crawl directory’s name matches the log file’s start time

Return type:

bool

property crawl_time#

Return the crawl_time spider argument if set, or the start_time crawl statistic otherwise. If neither is logged, return the time of the first log message.

Returns:

the crawl’s start time

Return type:

datetime.datetime

is_finished()[source]#

Return whether the log file contains a “Spider closed (finished)” log message or a finish_reason crawl statistic set to “finished”.

Returns:

whether the crawl finished cleanly

Return type:

bool

property item_counts#
Returns:

the number of each type of item, according to the log file

Return type:

dict

property spider_arguments#
Returns:

the spider argument

Return type:

dict

is_complete()[source]#

Return whether the crawl collected a subset of the dataset, according to the log file.

Returns:

whether the crawl collected a subset of the dataset

Return type:

bool

property error_rate#

Return an estimated lower bound of the true error rate.

Kingfisher Collect is expected to yield at most one FileError item per request leading to a File item, so the true error rate can only be less than this estimated lower bound if Kingfisher Collect breaks this expectation. On the other hand, the true error rate can easily be higher than the estimated lower bound; for example:

  • If the spider crawls 10 URLs, each returning 99 URLs, each returning OCDS data, and the requests for 5 of the 10 fail, then the estimated lower bound is 5 / 500 (1%), though the true error rate is 50%.

  • Similarly if the spider crawls 10 archive files, each containing 99 OCDS files.

Returns:

an estimated lower bound of the true error rate

Return type:

float