sharkipediaR — Development blueprint
Source:DEVELOPMENT.md
High-Level Philosophy
This package is NOT:
- a bulk mirroring tool;
- an aggressive crawler;
- a browser automation project;
- a Selenium project;
- a JS rendering framework.
This package IS:
- a lightweight R client;
- a tidy data interface;
- a scientific reproducibility tool;
- an ecology/fisheries workflow package;
- a structured HTML parsing system.
The Sharkipedia website appears primarily server-rendered using Ruby on Rails + HTML.
Current evidence suggests:
- species pages contain directly embedded HTML tables;
- extensive JS execution is not required;
-
rvestparsing is sufficient for most workflows.
Avoid unnecessary complexity.
Primary User Personas
Marine ecologists
Need:
- trait datasets;
- growth parameters;
- fisheries trends;
- reproducible extraction pipelines.
Quantitative fisheries scientists
Need:
- structured biological traits;
- trend retrieval;
- batch extraction;
- reproducible metadata.
Core Design Principles
1. Tidy-first
All user-facing outputs should return:
- tibble objects;
- long-form tidy structures where possible;
- consistent column naming.
Prefer:
- snake_case;
- explicit typing;
- explicit missing values.
2. Functional architecture
Separate:
- retrieval;
- parsing;
- cleaning;
- validation;
- formatting.
Avoid monolithic functions.
3. Robustness over cleverness
Website structure may evolve.
Prefer:
- semantic selectors;
- defensive parsing;
- validation checks;
- explicit warnings.
Avoid brittle positional selectors.
Suggested Package Structure
sharkipediaR/
├── R/
├── man/
├── tests/
├── vignettes/
├── data-raw/
├── inst/
├── DESCRIPTION
├── NAMESPACE
├── README.md
├── LICENSE
└── pkgdown.yml
Initial User-Facing API
Species-level extraction
sp_traits()
Primary trait extraction function.
Inputs:
- species URL;
- scientific name;
- vector of species.
Outputs:
Long-form tibble.
Expected columns:
- species;
- trait_name;
- value;
- standard;
- value_type;
- sex;
- location;
- reference;
- source_url;
- retrieved_at.
Trait-Centric Workflows
The package should not only support species-centric workflows.
Users should also be able to retrieve:
- all species containing a trait;
- all observations for a trait;
- trend-centric extractions.
Potential functions:
Parsing Architecture
IMPORTANT
Never combine:
- HTTP retrieval;
- HTML parsing;
- cleaning;
- formatting.
Each stage should be modular.
Recommended architecture
Retrieval
Responsibilities:
- request handling;
- retries;
- headers;
- rate limiting.
Returns:
- raw HTML document.
Parsing
parse_traits_table()
parse_taxonomy()
parse_references()Responsibilities:
- HTML selector logic only.
Error Handling
Use:
- purrr::possibly()
- purrr::safely()
- cli warnings/messages
- structured condition handling
Never silently fail.
All scraping functions should gracefully handle:
- missing tables;
- malformed pages;
- HTTP failures;
- changed HTML structure.
Parallelisation Strategy
Parallelisation should:
- be opt-in;
- use future/furrr;
- remain conservative.
Recommended defaults:
- sequential by default;
- optional multisession.
Never hammer the server.
Include randomised delays.
Example:
Caching Strategy
Strongly recommended.
Potential approaches:
- memoise;
- local RDS cache;
- pins integration later.
Cache:
- species index pages;
- species HTML;
- parsed tables.
HTML Parsing Guidance
Prefer:
- semantic selectors;
- table headers;
- text-based anchors.
Avoid:
- deeply nested positional selectors;
- JS-derived selectors;
- brittle nth-child logic.
Likely Website Technology
Current evidence suggests:
- Ruby on Rails backend;
- server-rendered HTML;
- Turbolinks/Turbo navigation;
- JS primarily for navigation enhancement.
Implications:
- rvest should suffice;
- browser automation should be avoided unless absolutely necessary.
Recommended AI Context Files
AI Coding Instructions
When modifying package code:
Testing Strategy
Use:
- testthat;
- snapshot tests where appropriate;
- selector validation tests.
Important test types:
Documentation Strategy
Use:
- roxygen2;
- pkgdown;
- long-form ecological examples.
Documentation should prioritise:
- reproducibility;
- ecological workflows;
- fisheries examples;
- tidyverse integration.
Example Desired Workflow
library(sharkipediaR)
library(dplyr)
traits <- sp_traits("Aetobatus narinari")
traits %>%
filter(trait_name == "Linf") %>%
summarise(mean_linf = mean(value, na.rm = TRUE))Recommended Immediate Tasks
Existing Relevant Resources
Sharkipedia website
Primary target.
Likely technologies:
- Ruby on Rails;
- HTML tables;
- JS-enhanced navigation.
Sharkipedia GitHub repository
Use for:
- understanding HTML structure;
- discovering route patterns;
- identifying semantic structure;
- anticipating future changes.
Avoid tightly coupling parsing logic to implementation internals.
Historical workshop
Reference:
https://github.com/creeas/shaRk-workshop
Potential uses:
- understanding ecological workflows;
- naming conventions;
- user expectations.