Skip to content

Commit

Permalink
Check external resources for broken links (#2171)
Browse files Browse the repository at this point in the history
* create app

* implement a basic logic

* Update admin.py

* fixed "valid is broken" error

* fixed checks

* WIP: 6301da2 fixed checks and added celeary periodic task

* added batching in celery tasks for external  resources

* fixed priority for external urls task

* fixed json serialization issue in celery task for external resource check

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added conditional task creation and fixed typing

* added tests for external resources

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* addded celery task priority configs

* added missing docstrings, moved constants to constants.py, logged exceptions and fixed spelling mistakes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed import

* Updated constants.py

* added docstrings and updated constants and task frequency

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added readme for external resources

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed spelling mistakes

* updated descriptions

* added flag to enable/disable celery task for external resources check

* updated priority

* Added section in readme to enable/disable external resource check task with env variable

---------

Co-authored-by: Umar Hassan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored May 28, 2024
1 parent f48bdd2 commit d6d5cd1
Show file tree
Hide file tree
Showing 21 changed files with 728 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -498,3 +498,7 @@ The following environment variables need to be defined in your .env file in orde
OPEN_CATALOG_URLS=delimited list of api endpoint urls that webhooks should be sent to
OPEN_CATALOG_WEBHOOK_KEY=secret key that will be used to confirm that webhook requests are legitimate
```

# Checking External Resource Availability

This feature sets up a cron job to validate external resource urls. The workflow for checking external resource availability is described [here](/external_resources/README.md).
48 changes: 48 additions & 0 deletions external_resources/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# External Resource Availability Workflow

This document describes the workflow for the external resources validation tasks.

**SECTIONS**

1. [Overview](#overview)
1. [Enabling Task](#enabling-task)
1. [Frequency Control](#frequency-control)
1. [Rate Limiting](#rate-limiting)
1. [Task Priority](#task-priority)


# Overview

This assumes that celery beat scheduler is installed and enabled, which is required for the task scheduling.

Frequency for the task is set to `1/week`. After each week, all external resources, new or existing, will be validated regardless of their last status.

The high-level description of the process is below, and each subsequent section contains additional details, including links to the relevant code.

* Task is automatically added in scheduler on system start.
* On execution, all available external resources are retrieved from DB.
* Gathered data is divided into preconfigured batch sizes.
* All batches are grouped into a single celery task and executed.
* Each batch-task iterates over batch to validate availability of each resource and its backup resource if available.
* The status of resource is then added to DB.
* Batch tasks have a preconfigured rate-limiter and lower priority by default.


## Enabling Task
The task for external resource checking can be enabled/disabled using the `CHECK_EXTERNAL_RESOURCE_TASK_ENABLE` defined in [here](/main/settings.py). However, once scheduled, the task can be removed only if `Celery` is restarted along with toggling `CHECK_EXTERNAL_RESOURCE_TASK_ENABLE` to `False`.

## Frequency Control

The task frequency (in seconds) is set using the `CHECK_EXTERNAL_RESOURCE_STATUS_FREQUENCY` in [here](/main/settings.py). Default value for the frequency is set to `604800 seconds -> 1 week`.


## Rate Limiting

The rate-limit for the external resource batch-tasks is set using `EXTERNAL_RESOURCE_TASK_RATE_LIMIT` in [here](constants.py). The assigned value for the rate-limiter is set to `100/s`.


## Task Priority

Batch-task priority is set using the `EXTERNAL_RESOURCE_TASK_PRIORITY` in [here](constants.py). The default priority for each celery task has been preconfigured to `2` out of range `0(lowest) - 4(highest)`. External resource tasks have lowest (`0`) priority by default.

Priority levels and celery default task priority can be configured by `PRIORITY_STEPS` and `DEFAULT_PRIORITY`, respectively, in [here](/main/constants.py).
Empty file added external_resources/__init__.py
Empty file.
56 changes: 56 additions & 0 deletions external_resources/admin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""External Resources Admin"""

from django.contrib import admin
from mitol.common.admin import TimestampedModelAdmin

from external_resources.models import ExternalResourceState


class ExternalResourceStateAdmin(TimestampedModelAdmin):
"""ExternalResourceState Admin"""

model = ExternalResourceState

include_created_on_in_list = True
search_fields = (
"content__text_id",
"content__title",
"content__website__name",
"content__website__title",
)
list_display = (
"get_content_title",
"get_content_text_id",
"get_website_name",
)
list_filter = ("status",)
raw_id_fields = ("content",)
ordering = ("-created_on",)

def get_queryset(self, request): # noqa: ARG002
"""Return data along with the related WebsiteContent"""
return self.model.objects.get_queryset().select_related("content__website")

def get_content_title(self, obj):
"""Return the related WebsiteContent title"""
return obj.content.title

get_content_title.short_description = "Content Title"
get_content_title.admin_order_field = "content__title"

def get_content_text_id(self, obj):
"""Return the related WebsiteContent text ID"""
return obj.content.text_id

get_content_text_id.short_description = "Content Text ID"
get_content_text_id.admin_order_field = "content__text_id"

def get_website_name(self, obj):
"""Return the related Website name"""
return obj.content.website.name

get_website_name.short_description = "Website"
get_website_name.admin_order_field = "content__website__name"


admin.site.register(ExternalResourceState, ExternalResourceStateAdmin)
53 changes: 53 additions & 0 deletions external_resources/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""External Resources API"""

import logging
from typing import Optional

import requests

from external_resources.constants import (
RESOURCE_BROKEN_STATUS_END,
RESOURCE_BROKEN_STATUS_START,
)
from external_resources.exceptions import CheckFailedError
from websites.models import WebsiteContent

log = logging.getLogger()


def is_url_broken(url: str) -> tuple[bool, Optional[int]]:
"""Check if provided url is broken"""
if url.strip() == "":
return False, None

log.debug("Making a HEAD request for url: %s", url)

try:
response = requests.head(url, allow_redirects=True, timeout=30)
except Exception as ex:
log.debug(ex)
raise CheckFailedError from ex

if (
response.status_code >= RESOURCE_BROKEN_STATUS_START
and response.status_code < RESOURCE_BROKEN_STATUS_END
):
return True, response.status_code

return False, response.status_code


def is_external_url_broken(
external_resource: WebsiteContent,
) -> tuple[bool, Optional[int]]:
"""Check if external url of the provided WebsiteContent is broken"""
url = external_resource.metadata.get("external_url", "")
return is_url_broken(url)


def is_backup_url_broken(
external_resource: WebsiteContent,
) -> tuple[bool, Optional[int]]:
"""Check if backup url of the provided WebsiteContent is broken"""
url = external_resource.metadata.get("backup_url", "")
return is_url_broken(url)
46 changes: 46 additions & 0 deletions external_resources/api_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""Tests for External Resources API"""

import pytest

from external_resources.api import is_url_broken
from external_resources.constants import RESOURCE_UNCHECKED_STATUSES
from external_resources.exceptions import CheckFailedError


def test_is_url_broken_valid(mocker):
"""Test for working url"""
mock_response = mocker.Mock(status_code=200)
mocker.patch("external_resources.api.requests.head", return_value=mock_response)

result, status_code = is_url_broken("http://google.com")
assert not result
assert status_code == 200


@pytest.mark.parametrize("status_code", RESOURCE_UNCHECKED_STATUSES)
def test_is_url_broken_whitelisted(mocker, status_code):
"""Test for broken url"""
mock_response = mocker.Mock(status_code=status_code)
mocker.patch("external_resources.api.requests.head", return_value=mock_response)

result, response_status_code = is_url_broken("http://google.com/")
assert result
assert response_status_code == status_code


def test_is_url_broken_empty():
"""Test for empty url"""
result, status_code = is_url_broken("")
assert not result
assert status_code is None


def test_is_url_broken_exception(mocker):
"""Test for connection error"""
mocker.patch(
"external_resources.api.requests.head",
side_effect=CheckFailedError,
)

with pytest.raises(CheckFailedError):
is_url_broken("http://google.com")
10 changes: 10 additions & 0 deletions external_resources/apps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""External Resources Apps"""

from django.apps import AppConfig


class ExternalResourcesConfig(AppConfig):
"""App for External Resources"""

default_auto_field = "django.db.models.BigAutoField"
name = "external_resources"
26 changes: 26 additions & 0 deletions external_resources/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""Constants for External Resources module"""

# HTTP Status Codes
HTTP_BAD_REQUEST = 400
HTTP_UNAUTHORIZED = 401
HTTP_PAYMENT_REQUIRED = 402
HTTP_FORBIDDEN = 403
HTTP_TOO_MANY_REQUESTS = 429
HTTP_REQUEST_TIMEOUT = 408
HTTP_SERVICE_UNAVAILABLE = 503

# External Resource
RESOURCE_BROKEN_STATUS_START = HTTP_BAD_REQUEST
RESOURCE_BROKEN_STATUS_END = 600
RESOURCE_UNCHECKED_STATUSES = [
HTTP_UNAUTHORIZED,
HTTP_PAYMENT_REQUIRED,
HTTP_FORBIDDEN,
HTTP_TOO_MANY_REQUESTS,
HTTP_REQUEST_TIMEOUT,
HTTP_SERVICE_UNAVAILABLE,
]

# Celery Task
EXTERNAL_RESOURCE_TASK_RATE_LIMIT = "100/s"
EXTERNAL_RESOURCE_TASK_PRIORITY = 4 # Lowest priority from range (0 - 4)
5 changes: 5 additions & 0 deletions external_resources/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""External Resources Exceptions"""


class CheckFailedError(Exception):
"""Check Failed Exception"""
24 changes: 24 additions & 0 deletions external_resources/factories.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""External Resources Factories"""

import factory
import pytz

from external_resources.models import ExternalResourceState
from websites.factories import WebsiteContentFactory


class ExternalResourceStateFactory(factory.django.DjangoModelFactory):
"""External Resource Factory"""

class Meta:
"""Meta class for External Resource State Factory"""

model = ExternalResourceState

content = factory.SubFactory(WebsiteContentFactory)
status = ExternalResourceState.Status.UNCHECKED
external_url_response_code = factory.Faker("random_int", min=100, max=599)
backup_url_response_code = factory.Faker("random_int", min=100, max=599)
is_external_url_broken = factory.Faker("boolean")
is_backup_url_broken = factory.Faker("boolean")
last_checked = factory.Faker("date_time", tzinfo=pytz.utc)
83 changes: 83 additions & 0 deletions external_resources/migrations/0001_initial.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Generated by Django 4.2.11 on 2024-05-02 09:34

import django.db.models.deletion
from django.db import migrations, models


class Migration(migrations.Migration):
"""Initial Migration for External Resources"""

initial = True

dependencies = [
("websites", "0053_safedelete_deleted_by_cascade"),
]

operations = [
migrations.CreateModel(
name="ExternalResourceState",
fields=[
(
"id",
models.BigAutoField(
auto_created=True,
primary_key=True,
serialize=False,
verbose_name="ID",
),
),
("created_on", models.DateTimeField(auto_now_add=True)),
("updated_on", models.DateTimeField(auto_now=True)),
(
"status",
models.CharField(
choices=[
("unchecked", "Unchecked or pending check"),
("valid", "Either URL or backup URL is valid"),
("broken", "Both URL and backup URL are broken"),
],
default="unchecked",
help_text="The status of this external resource.",
max_length=16,
),
),
(
"external_url_response_code",
models.IntegerField(blank=True, default=None, null=True),
),
(
"backup_url_response_code",
models.IntegerField(blank=True, default=None, null=True),
),
(
"is_external_url_broken",
models.BooleanField(blank=True, default=None, null=True),
),
(
"is_backup_url_broken",
models.BooleanField(blank=True, default=None, null=True),
),
(
"last_checked",
models.DateTimeField(
blank=True,
default=None,
help_text="The last time when this resource"
" was checked for breakages.",
null=True,
),
),
(
"content",
models.OneToOneField(
on_delete=django.db.models.deletion.CASCADE,
related_name="external_resource_state",
to="websites.websitecontent",
),
),
],
options={
"abstract": False,
},
),
]
Loading

0 comments on commit d6d5cd1

Please sign in to comment.