Internal design notes for contributors and AI-assisted development.
User-facing documentation is in README.md and the pkgdown site.

Purpose

This document is intended for AI coding agents (Cursor, Claude Code workflows, ECC-style systems) and human collaborators building the sharkipediaR package.

The package goal is to provide a robust, reproducible, tidyverse-oriented R interface to publicly accessible Sharkipedia data.

The package should:

scrape species-level information from Sharkipedia;
expose trait and trend data through stable R functions;
support ecological and fisheries workflows;
integrate naturally with tidyverse pipelines;
prioritise reproducibility and scientific usability;
be scalable for future API or backend changes;
support parallel retrieval safely and politely.

High-Level Philosophy

This package is NOT:

a bulk mirroring tool;
an aggressive crawler;
a browser automation project;
a Selenium project;
a JS rendering framework.

This package IS:

a lightweight R client;
a tidy data interface;
a scientific reproducibility tool;
an ecology/fisheries workflow package;
a structured HTML parsing system.

The Sharkipedia website appears primarily server-rendered using Ruby on Rails + HTML.

Current evidence suggests:

species pages contain directly embedded HTML tables;
extensive JS execution is not required;
rvest parsing is sufficient for most workflows.

Avoid unnecessary complexity.

Primary User Personas

Marine ecologists

Need:

trait datasets;
growth parameters;
fisheries trends;
reproducible extraction pipelines.

Quantitative fisheries scientists

Need:

structured biological traits;
trend retrieval;
batch extraction;
reproducible metadata.

Comparative ecology researchers

Need:

species-level extraction;
taxonomic metadata;
tidy trait tables.

Package developers

Need:

stable interfaces;
composable functions;
clean return objects.

Core Design Principles

1. Tidy-first

All user-facing outputs should return:

tibble objects;
long-form tidy structures where possible;
consistent column naming.

Prefer:

snake_case;
explicit typing;
explicit missing values.

2. Functional architecture

Separate:

retrieval;
parsing;
cleaning;
validation;
formatting.

Avoid monolithic functions.

3. Robustness over cleverness

Website structure may evolve.

Prefer:

semantic selectors;
defensive parsing;
validation checks;
explicit warnings.

Avoid brittle positional selectors.

4. Polite scraping

The package must:

rate limit requests;
support caching;
minimise repeated downloads;
identify itself responsibly.

Never aggressively parallelise.

5. Scientific reproducibility

Every output should ideally include:

source URL;
retrieval timestamp;
species name;
citation/reference fields where available.

Recommended Package Stack

Core tidyverse

Required:

dplyr
purrr
tidyr
tibble
stringr
readr

Scraping

Primary:

rvest
xml2
httr2

Avoid RCurl unless absolutely required.

Parallelisation

furrr
future
progressr

Parallelisation must be optional and conservative.

Reliability

ratelimitr
memoise
cli
rlang

Package infrastructure

usethis
devtools
roxygen2
testthat
pkgdown
lintr
styler

Suggested Package Structure

sharkipediaR/
├── R/
├── man/
├── tests/
├── vignettes/
├── data-raw/
├── inst/
├── DESCRIPTION
├── NAMESPACE
├── README.md
├── LICENSE
└── pkgdown.yml

Initial User-Facing API

Discovery functions

`sp_search()`

Search species names.

Inputs:

partial species names;
common names;
taxonomic groups.

Returns:

tibble of matched species;
URLs;
taxonomy metadata.

`sp_species_urls()`

Retrieve all known species URLs.

Should:

scrape species index pages;
deduplicate results;
cache outputs.

Species-level extraction

`sp_species()`

Retrieve:

taxonomy;
metadata;
external references.

Return one-row tibble.

`sp_traits()`

Primary trait extraction function.

Inputs:

species URL;
scientific name;
vector of species.

Outputs:

Long-form tibble.

Expected columns:

species;
trait_name;
value;
standard;
value_type;
sex;
location;
reference;
source_url;
retrieved_at.

`sp_trends()`

Retrieve trend data.

Should support:

species filtering;
region filtering where available.

`sp_references()`

Extract references/citations.

Trait-Centric Workflows

The package should not only support species-centric workflows.

Users should also be able to retrieve:

all species containing a trait;
all observations for a trait;
trend-centric extractions.

Potential functions:

`sp_trait_catalogue()`

Return available traits.

`sp_trait_data()`

Retrieve observations for a trait across species.

`sp_trend_catalogue()`

Return available trends.

Parsing Architecture

IMPORTANT

Never combine:

HTTP retrieval;
HTML parsing;
cleaning;
formatting.

Each stage should be modular.

Recommended architecture

Retrieval

fetch_page()

Responsibilities:

request handling;
retries;
headers;
rate limiting.

Returns:

raw HTML document.

Parsing

parse_traits_table()
parse_taxonomy()
parse_references()

Responsibilities:

HTML selector logic only.

Cleaning

clean_traits()

Responsibilities:

typing;
naming;
standardisation.

Validation

validate_traits()

Responsibilities:

required columns;
missing data;
malformed tables.

Error Handling

Use:

purrr::possibly()
purrr::safely()
cli warnings/messages
structured condition handling

Never silently fail.

All scraping functions should gracefully handle:

missing tables;
malformed pages;
HTTP failures;
changed HTML structure.

Parallelisation Strategy

Parallelisation should:

be opt-in;
use future/furrr;
remain conservative.

Recommended defaults:

sequential by default;
optional multisession.

Never hammer the server.

Include randomised delays.

Example:

Sys.sleep(runif(1, 0.5, 1.5))

Caching Strategy

Strongly recommended.

Potential approaches:

memoise;
local RDS cache;
pins integration later.

Cache:

species index pages;
species HTML;
parsed tables.

HTML Parsing Guidance

Prefer:

semantic selectors;
table headers;
text-based anchors.

Avoid:

deeply nested positional selectors;
JS-derived selectors;
brittle nth-child logic.

Likely Website Technology

Current evidence suggests:

Ruby on Rails backend;
server-rendered HTML;
Turbolinks/Turbo navigation;
JS primarily for navigation enhancement.

Implications:

rvest should suffice;
browser automation should be avoided unless absolutely necessary.

ECC / AI Workflow Integration

This repository should be optimised for AI-assisted development.

Recommended AI Context Files

`project_context.md`

Contains:

package philosophy;
coding conventions;
package goals;
dependency constraints.

`selectors.md`

Tracks:

known CSS selectors;
page structures;
fragile parsing areas.

`website_notes.md`

Tracks:

discovered routes;
HTML structures;
observed inconsistencies.

`development_tasks.md`

Tracks:

TODOs;
parser stability;
coverage gaps.

AI Coding Instructions

When modifying package code:

ALWAYS

preserve tidyverse semantics;
write small composable functions;
separate parsing from cleaning;
maintain explicit typing;
add tests for selectors;
prefer readability over cleverness.

NEVER

write giant monolithic scraping functions;
hardcode fragile selectors unnecessarily;
aggressively parallelise requests;
use browser automation unless justified;
silently suppress parsing failures.

Testing Strategy

Use:

testthat;
snapshot tests where appropriate;
selector validation tests.

Important test types:

HTML structure tests

Ensure selectors still work.

Parsing tests

Ensure:

column names;
typing;
expected row structures.

Regression tests

Store representative HTML fixtures.

Avoid relying entirely on live requests.

Documentation Strategy

Use:

roxygen2;
pkgdown;
long-form ecological examples.

Documentation should prioritise:

reproducibility;
ecological workflows;
fisheries examples;
tidyverse integration.

Example Desired Workflow

library(sharkipediaR)
library(dplyr)

traits <- sp_traits("Aetobatus narinari")

traits %>%
  filter(trait_name == "Linf") %>%
  summarise(mean_linf = mean(value, na.rm = TRUE))

Potential Future Extensions

Spatial workflows

Potential integration:

sf
terra

Offline caching

Potential:

pins
arrow/parquet

Taxonomic harmonisation

Potential:

fishbase joins;
WoRMS integration.

API support

If Sharkipedia later exposes APIs:

retain stable user-facing interfaces;
swap backend retrieval layer only.

Publication Goals

Initial

GitHub package;
MIT licence;
pkgdown site.

Later

CRAN readiness;
CI/CD;
stable release cycle.

Recommended Immediate Tasks

Phase 1

create package skeleton;
configure usethis;
implement species URL scraping;
implement species parser.

Phase 2

implement trait extraction;
implement cleaning pipelines;
implement tests.

Phase 3

parallel retrieval workflows;
caching;
pkgdown site.

Phase 4

trait-centric interfaces;
trend extraction;
advanced workflows.

Existing Relevant Resources

Sharkipedia website

Primary target.

Likely technologies:

Ruby on Rails;
HTML tables;
JS-enhanced navigation.

Sharkipedia GitHub repository

Use for:

understanding HTML structure;
discovering route patterns;
identifying semantic structure;
anticipating future changes.

Avoid tightly coupling parsing logic to implementation internals.

Historical workshop

Reference:

https://github.com/creeas/shaRk-workshop

Potential uses:

understanding ecological workflows;
naming conventions;
user expectations.

Final Architectural Principle

The package should feel like:

a modern tidyverse client;
a scientific data interface;
a reproducible ecology toolkit.

Not:

a brittle scraping script collection.

sharkipediaR — Development blueprint

Purpose

High-Level Philosophy

Primary User Personas

Marine ecologists

Quantitative fisheries scientists

Comparative ecology researchers

Package developers

Core Design Principles

1. Tidy-first

2. Functional architecture

3. Robustness over cleverness

4. Polite scraping

5. Scientific reproducibility

Recommended Package Stack

Core tidyverse

Scraping

Parallelisation

Reliability

Package infrastructure

Suggested Package Structure

Initial User-Facing API

Discovery functions

sp_search()

sp_species_urls()

Species-level extraction

sp_species()

sp_traits()

sp_trends()

sp_references()

Trait-Centric Workflows

sp_trait_catalogue()

sp_trait_data()

sp_trend_catalogue()

Parsing Architecture

IMPORTANT

Recommended architecture

Retrieval

Parsing

Cleaning

Validation

Error Handling

Parallelisation Strategy

Caching Strategy

HTML Parsing Guidance

Likely Website Technology

ECC / AI Workflow Integration

Recommended AI Context Files

project_context.md

selectors.md

website_notes.md

development_tasks.md

AI Coding Instructions

ALWAYS

NEVER

Testing Strategy

HTML structure tests

Parsing tests

Regression tests

Documentation Strategy

Example Desired Workflow

Potential Future Extensions

Spatial workflows

Offline caching

Taxonomic harmonisation

API support

Publication Goals

Initial

Later

Recommended Immediate Tasks

Phase 1

Phase 2

Phase 3

Phase 4

Existing Relevant Resources

Sharkipedia website

Sharkipedia GitHub repository

Historical workshop

Final Architectural Principle

`sp_search()`

`sp_species_urls()`

`sp_species()`

`sp_traits()`

`sp_trends()`

`sp_references()`

`sp_trait_catalogue()`

`sp_trait_data()`

`sp_trend_catalogue()`

`project_context.md`

`selectors.md`

`website_notes.md`

`development_tasks.md`