-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Check external resources for broken links (#2171)
* create app * implement a basic logic * Update admin.py * fixed "valid is broken" error * fixed checks * WIP: 6301da2 fixed checks and added celeary periodic task * added batching in celery tasks for external resources * fixed priority for external urls task * fixed json serialization issue in celery task for external resource check * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added conditional task creation and fixed typing * added tests for external resources * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * addded celery task priority configs * added missing docstrings, moved constants to constants.py, logged exceptions and fixed spelling mistakes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed import * Updated constants.py * added docstrings and updated constants and task frequency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added readme for external resources * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed spelling mistakes * updated descriptions * added flag to enable/disable celery task for external resources check * updated priority * Added section in readme to enable/disable external resource check task with env variable --------- Co-authored-by: Umar Hassan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
f48bdd2
commit d6d5cd1
Showing
21 changed files
with
728 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# External Resource Availability Workflow | ||
|
||
This document describes the workflow for the external resources validation tasks. | ||
|
||
**SECTIONS** | ||
|
||
1. [Overview](#overview) | ||
1. [Enabling Task](#enabling-task) | ||
1. [Frequency Control](#frequency-control) | ||
1. [Rate Limiting](#rate-limiting) | ||
1. [Task Priority](#task-priority) | ||
|
||
|
||
# Overview | ||
|
||
This assumes that celery beat scheduler is installed and enabled, which is required for the task scheduling. | ||
|
||
Frequency for the task is set to `1/week`. After each week, all external resources, new or existing, will be validated regardless of their last status. | ||
|
||
The high-level description of the process is below, and each subsequent section contains additional details, including links to the relevant code. | ||
|
||
* Task is automatically added in scheduler on system start. | ||
* On execution, all available external resources are retrieved from DB. | ||
* Gathered data is divided into preconfigured batch sizes. | ||
* All batches are grouped into a single celery task and executed. | ||
* Each batch-task iterates over batch to validate availability of each resource and its backup resource if available. | ||
* The status of resource is then added to DB. | ||
* Batch tasks have a preconfigured rate-limiter and lower priority by default. | ||
|
||
|
||
## Enabling Task | ||
The task for external resource checking can be enabled/disabled using the `CHECK_EXTERNAL_RESOURCE_TASK_ENABLE` defined in [here](/main/settings.py). However, once scheduled, the task can be removed only if `Celery` is restarted along with toggling `CHECK_EXTERNAL_RESOURCE_TASK_ENABLE` to `False`. | ||
|
||
## Frequency Control | ||
|
||
The task frequency (in seconds) is set using the `CHECK_EXTERNAL_RESOURCE_STATUS_FREQUENCY` in [here](/main/settings.py). Default value for the frequency is set to `604800 seconds -> 1 week`. | ||
|
||
|
||
## Rate Limiting | ||
|
||
The rate-limit for the external resource batch-tasks is set using `EXTERNAL_RESOURCE_TASK_RATE_LIMIT` in [here](constants.py). The assigned value for the rate-limiter is set to `100/s`. | ||
|
||
|
||
## Task Priority | ||
|
||
Batch-task priority is set using the `EXTERNAL_RESOURCE_TASK_PRIORITY` in [here](constants.py). The default priority for each celery task has been preconfigured to `2` out of range `0(lowest) - 4(highest)`. External resource tasks have lowest (`0`) priority by default. | ||
|
||
Priority levels and celery default task priority can be configured by `PRIORITY_STEPS` and `DEFAULT_PRIORITY`, respectively, in [here](/main/constants.py). |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
"""External Resources Admin""" | ||
|
||
from django.contrib import admin | ||
from mitol.common.admin import TimestampedModelAdmin | ||
|
||
from external_resources.models import ExternalResourceState | ||
|
||
|
||
class ExternalResourceStateAdmin(TimestampedModelAdmin): | ||
"""ExternalResourceState Admin""" | ||
|
||
model = ExternalResourceState | ||
|
||
include_created_on_in_list = True | ||
search_fields = ( | ||
"content__text_id", | ||
"content__title", | ||
"content__website__name", | ||
"content__website__title", | ||
) | ||
list_display = ( | ||
"get_content_title", | ||
"get_content_text_id", | ||
"get_website_name", | ||
) | ||
list_filter = ("status",) | ||
raw_id_fields = ("content",) | ||
ordering = ("-created_on",) | ||
|
||
def get_queryset(self, request): # noqa: ARG002 | ||
"""Return data along with the related WebsiteContent""" | ||
return self.model.objects.get_queryset().select_related("content__website") | ||
|
||
def get_content_title(self, obj): | ||
"""Return the related WebsiteContent title""" | ||
return obj.content.title | ||
|
||
get_content_title.short_description = "Content Title" | ||
get_content_title.admin_order_field = "content__title" | ||
|
||
def get_content_text_id(self, obj): | ||
"""Return the related WebsiteContent text ID""" | ||
return obj.content.text_id | ||
|
||
get_content_text_id.short_description = "Content Text ID" | ||
get_content_text_id.admin_order_field = "content__text_id" | ||
|
||
def get_website_name(self, obj): | ||
"""Return the related Website name""" | ||
return obj.content.website.name | ||
|
||
get_website_name.short_description = "Website" | ||
get_website_name.admin_order_field = "content__website__name" | ||
|
||
|
||
admin.site.register(ExternalResourceState, ExternalResourceStateAdmin) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
"""External Resources API""" | ||
|
||
import logging | ||
from typing import Optional | ||
|
||
import requests | ||
|
||
from external_resources.constants import ( | ||
RESOURCE_BROKEN_STATUS_END, | ||
RESOURCE_BROKEN_STATUS_START, | ||
) | ||
from external_resources.exceptions import CheckFailedError | ||
from websites.models import WebsiteContent | ||
|
||
log = logging.getLogger() | ||
|
||
|
||
def is_url_broken(url: str) -> tuple[bool, Optional[int]]: | ||
"""Check if provided url is broken""" | ||
if url.strip() == "": | ||
return False, None | ||
|
||
log.debug("Making a HEAD request for url: %s", url) | ||
|
||
try: | ||
response = requests.head(url, allow_redirects=True, timeout=30) | ||
except Exception as ex: | ||
log.debug(ex) | ||
raise CheckFailedError from ex | ||
|
||
if ( | ||
response.status_code >= RESOURCE_BROKEN_STATUS_START | ||
and response.status_code < RESOURCE_BROKEN_STATUS_END | ||
): | ||
return True, response.status_code | ||
|
||
return False, response.status_code | ||
|
||
|
||
def is_external_url_broken( | ||
external_resource: WebsiteContent, | ||
) -> tuple[bool, Optional[int]]: | ||
"""Check if external url of the provided WebsiteContent is broken""" | ||
url = external_resource.metadata.get("external_url", "") | ||
return is_url_broken(url) | ||
|
||
|
||
def is_backup_url_broken( | ||
external_resource: WebsiteContent, | ||
) -> tuple[bool, Optional[int]]: | ||
"""Check if backup url of the provided WebsiteContent is broken""" | ||
url = external_resource.metadata.get("backup_url", "") | ||
return is_url_broken(url) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
"""Tests for External Resources API""" | ||
|
||
import pytest | ||
|
||
from external_resources.api import is_url_broken | ||
from external_resources.constants import RESOURCE_UNCHECKED_STATUSES | ||
from external_resources.exceptions import CheckFailedError | ||
|
||
|
||
def test_is_url_broken_valid(mocker): | ||
"""Test for working url""" | ||
mock_response = mocker.Mock(status_code=200) | ||
mocker.patch("external_resources.api.requests.head", return_value=mock_response) | ||
|
||
result, status_code = is_url_broken("http://google.com") | ||
assert not result | ||
assert status_code == 200 | ||
|
||
|
||
@pytest.mark.parametrize("status_code", RESOURCE_UNCHECKED_STATUSES) | ||
def test_is_url_broken_whitelisted(mocker, status_code): | ||
"""Test for broken url""" | ||
mock_response = mocker.Mock(status_code=status_code) | ||
mocker.patch("external_resources.api.requests.head", return_value=mock_response) | ||
|
||
result, response_status_code = is_url_broken("http://google.com/") | ||
assert result | ||
assert response_status_code == status_code | ||
|
||
|
||
def test_is_url_broken_empty(): | ||
"""Test for empty url""" | ||
result, status_code = is_url_broken("") | ||
assert not result | ||
assert status_code is None | ||
|
||
|
||
def test_is_url_broken_exception(mocker): | ||
"""Test for connection error""" | ||
mocker.patch( | ||
"external_resources.api.requests.head", | ||
side_effect=CheckFailedError, | ||
) | ||
|
||
with pytest.raises(CheckFailedError): | ||
is_url_broken("http://google.com") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
"""External Resources Apps""" | ||
|
||
from django.apps import AppConfig | ||
|
||
|
||
class ExternalResourcesConfig(AppConfig): | ||
"""App for External Resources""" | ||
|
||
default_auto_field = "django.db.models.BigAutoField" | ||
name = "external_resources" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
"""Constants for External Resources module""" | ||
|
||
# HTTP Status Codes | ||
HTTP_BAD_REQUEST = 400 | ||
HTTP_UNAUTHORIZED = 401 | ||
HTTP_PAYMENT_REQUIRED = 402 | ||
HTTP_FORBIDDEN = 403 | ||
HTTP_TOO_MANY_REQUESTS = 429 | ||
HTTP_REQUEST_TIMEOUT = 408 | ||
HTTP_SERVICE_UNAVAILABLE = 503 | ||
|
||
# External Resource | ||
RESOURCE_BROKEN_STATUS_START = HTTP_BAD_REQUEST | ||
RESOURCE_BROKEN_STATUS_END = 600 | ||
RESOURCE_UNCHECKED_STATUSES = [ | ||
HTTP_UNAUTHORIZED, | ||
HTTP_PAYMENT_REQUIRED, | ||
HTTP_FORBIDDEN, | ||
HTTP_TOO_MANY_REQUESTS, | ||
HTTP_REQUEST_TIMEOUT, | ||
HTTP_SERVICE_UNAVAILABLE, | ||
] | ||
|
||
# Celery Task | ||
EXTERNAL_RESOURCE_TASK_RATE_LIMIT = "100/s" | ||
EXTERNAL_RESOURCE_TASK_PRIORITY = 4 # Lowest priority from range (0 - 4) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
"""External Resources Exceptions""" | ||
|
||
|
||
class CheckFailedError(Exception): | ||
"""Check Failed Exception""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
"""External Resources Factories""" | ||
|
||
import factory | ||
import pytz | ||
|
||
from external_resources.models import ExternalResourceState | ||
from websites.factories import WebsiteContentFactory | ||
|
||
|
||
class ExternalResourceStateFactory(factory.django.DjangoModelFactory): | ||
"""External Resource Factory""" | ||
|
||
class Meta: | ||
"""Meta class for External Resource State Factory""" | ||
|
||
model = ExternalResourceState | ||
|
||
content = factory.SubFactory(WebsiteContentFactory) | ||
status = ExternalResourceState.Status.UNCHECKED | ||
external_url_response_code = factory.Faker("random_int", min=100, max=599) | ||
backup_url_response_code = factory.Faker("random_int", min=100, max=599) | ||
is_external_url_broken = factory.Faker("boolean") | ||
is_backup_url_broken = factory.Faker("boolean") | ||
last_checked = factory.Faker("date_time", tzinfo=pytz.utc) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Generated by Django 4.2.11 on 2024-05-02 09:34 | ||
|
||
import django.db.models.deletion | ||
from django.db import migrations, models | ||
|
||
|
||
class Migration(migrations.Migration): | ||
"""Initial Migration for External Resources""" | ||
|
||
initial = True | ||
|
||
dependencies = [ | ||
("websites", "0053_safedelete_deleted_by_cascade"), | ||
] | ||
|
||
operations = [ | ||
migrations.CreateModel( | ||
name="ExternalResourceState", | ||
fields=[ | ||
( | ||
"id", | ||
models.BigAutoField( | ||
auto_created=True, | ||
primary_key=True, | ||
serialize=False, | ||
verbose_name="ID", | ||
), | ||
), | ||
("created_on", models.DateTimeField(auto_now_add=True)), | ||
("updated_on", models.DateTimeField(auto_now=True)), | ||
( | ||
"status", | ||
models.CharField( | ||
choices=[ | ||
("unchecked", "Unchecked or pending check"), | ||
("valid", "Either URL or backup URL is valid"), | ||
("broken", "Both URL and backup URL are broken"), | ||
], | ||
default="unchecked", | ||
help_text="The status of this external resource.", | ||
max_length=16, | ||
), | ||
), | ||
( | ||
"external_url_response_code", | ||
models.IntegerField(blank=True, default=None, null=True), | ||
), | ||
( | ||
"backup_url_response_code", | ||
models.IntegerField(blank=True, default=None, null=True), | ||
), | ||
( | ||
"is_external_url_broken", | ||
models.BooleanField(blank=True, default=None, null=True), | ||
), | ||
( | ||
"is_backup_url_broken", | ||
models.BooleanField(blank=True, default=None, null=True), | ||
), | ||
( | ||
"last_checked", | ||
models.DateTimeField( | ||
blank=True, | ||
default=None, | ||
help_text="The last time when this resource" | ||
" was checked for breakages.", | ||
null=True, | ||
), | ||
), | ||
( | ||
"content", | ||
models.OneToOneField( | ||
on_delete=django.db.models.deletion.CASCADE, | ||
related_name="external_resource_state", | ||
to="websites.websitecontent", | ||
), | ||
), | ||
], | ||
options={ | ||
"abstract": False, | ||
}, | ||
), | ||
] |
Oops, something went wrong.