Package architecture and function reference

library(sharkipediaR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

vignette_fixture <- function(name) {
  path <- system.file("fixtures", name, package = "sharkipediaR")
  if (!nzchar(path)) {
    cli::cli_abort("Fixture not found: {name}")
  }
  path
}

doc <- xml2::read_html(vignette_fixture("carcharhinus_acronotus.html"))
index_doc <- xml2::read_html(vignette_fixture("species_index_page1.html"))
source_url <- "https://www.sharkipedia.org/species/carcharhinus-acronotus"
retrieved_at <- as.POSIXct("2026-05-25 12:00:00", tz = "UTC")

Design philosophy

sharkipediaR follows a strict pipeline:

HTTP retrieval  →  HTML parsing  →  cleaning  →  validation  →  user-facing tibble
(fetch.R)         (parse.R)        (clean.R)     (validate.R)   (sp_*.R)

No step is allowed to “do everything.” That separation keeps the package maintainable when Sharkipedia’s HTML evolves.

Data flow for sp_traits()

`R/constants.R`

Object / function	Role
`.sharkipedia_env`	Package-private environment storing last request time for rate limiting
`sharkipedia_base_url()`	Returns `https://www.sharkipedia.org`
`sharkipedia_user_agent()`	Identifies the client in HTTP headers (version + project URL)

These are internal but underpin polite scraping.

`R/utils.R`

`rate_limit_pause(min_gap = 0.5)`

Waits until at least min_gap seconds have elapsed since the last request, plus a small random jitter (runif(0, 0.3)). Called automatically by fetch_page().

`species_name_to_slug(name)`

Converts "Carcharhinus acronotus" → "carcharhinus-acronotus" (lowercase, spaces to hyphens).

`species_name_to_url(name)`

Builds the full species page URL from a scientific name.

sharkipediaR:::species_name_to_url("Carcharhinus acronotus")
#> [1] "https://www.sharkipedia.org/species/carcharhinus-acronotus"

`normalize_sharkipedia_url(url)`

Accepts:

full URLs,
relative paths (/species/...),
bare scientific names.

`resolve_species_url(x)`

Length-1 resolver used by all sp_*() functions. Errors if length(x) != 1.

`ensure_species_vector(x)`

Trims, de-duplicates, and validates a character vector for batch functions.

`read_html_fixture(path)`

Reads local HTML for tests (internal).

`get_fetch_page(cache = TRUE)`

Returns either fetch_page or memoised fetch_page_memoised depending on the cache argument passed to user functions.

`R/fetch.R`

`fetch_page(url, quiet = TRUE)` — exported

Responsibilities only: HTTP, rate limit, retries, return xml_document.

Step	Implementation
URL normalisation	`normalize_sharkipedia_url()`
Politeness	`rate_limit_pause()`
Request	`httr2::request()` + user agent + `req_retry(max_tries = 3)`
Error handling	Abort on status ≥ 400 with `cli` message
Body	`httr2::resp_body_html()` → xml2 document

doc_live <- fetch_page("Carcharhinus acronotus", quiet = FALSE)
class(doc_live)

`fetch_page_memoised`

memoise::memoise(fetch_page) — identical URLs in one R session return the cached document without a new HTTP call.

`R/parse.R`

Parsing functions never perform HTTP. They only read xml_document objects.

`parse_species_index(doc)`

Used by: sp_species_urls()

Selector: a[href^='/species/']
Excludes CSV export links (*.csv slugs)
Returns species, slug, url

idx <- sharkipediaR:::parse_species_index(index_doc)
head(idx, 4)
#> # A tibble: 4 × 3
#>   species             slug                              url                     
#>   <chr>               <chr>                             <chr>                   
#> 1 Rajella leopardus   rajella-leopardus                 https://www.sharkipedia…
#> 2 Gymnura lessae      gymnura-lessae                    https://www.sharkipedia…
#> 3 Styracura schmardae html-i-styracura-schmardae-i-html https://www.sharkipedia…
#> 4 Dipturus lamillai   dipturus-lamillai                 https://www.sharkipedia…

`parse_index_last_page(doc)`

Reads Bulma pagination links (.pagination-link) and returns the maximum page number — used when sp_species_urls(all_pages = TRUE).

`parse_taxonomy(doc)`

Used by: sp_species()

Species name: h1.title
Ranks: div.columns > div.column:first-child > p with labels Superorder:, Subclass:, Order:, Family:

sharkipediaR:::parse_taxonomy(doc)
#> # A tibble: 1 × 5
#>   species                superorder   subclass       order             family   
#>   <chr>                  <chr>        <chr>          <chr>             <chr>    
#> 1 Carcharhinus acronotus Galeomorphii Elasmobranchii Carcharhiniformes Carcharh…

Trait table helpers

Function	Role
`trait_table_headers()`	Reads `<thead><th>` text
`is_trait_table()`	`TRUE` if headers include `"Name"`
`preceding_trait_group()`	XPath `preceding::h4[contains(@class,'subtitle')][1]` for trait class

`parse_traits_tables(doc)`

Used by: sp_traits() via fetch_species_traits()

Find all table.table elements.
Keep tables whose headers match Sharkipedia’s trait schema.
rvest::html_table() for body rows.
Attach trait_group from the nearest preceding h4.

raw_traits <- sharkipediaR:::parse_traits_tables(doc)
names(raw_traits)
#> [1] "Name"        "Value"       "Standard"    "ValueType"   "Sex"        
#> [6] "Location"    "Reference"   "trait_group"
head(raw_traits, 3)
#> # A tibble: 3 × 8
#>   Name             Value Standard ValueType Sex   Location Reference trait_group
#>   <chr>            <chr> <chr>    <chr>     <chr> <chr>    <chr>     <chr>      
#> 1 Direct Predation Medi… Effect … mean      Pool… Caribbe… clementi… Ecological…
#> 2 Direct Predation Medi… Strengt… mean      Pool… Caribbe… clementi… Ecological…
#> 3 Amat50           4.3   Year     mean      Male  South C… driggers… Age

Note: column names are still PascalCase (Name, Value, …) at this stage.

`decode_react_props(props)`

Decodes HTML entities in data-react-props JSON and parses with jsonlite. Internal helper for trends.

`parse_trends_tables(doc)`

Used by: sp_trends()

Locate the first table after the h3 heading containing "Trends".
For each <tr>:
- location, unit, reference from table cells
- trend_id / trend_url from the “Details” link
- observations matrix from div[data-react-props]
Expand to one row per year × value.

raw_trends <- sharkipediaR:::parse_trends_tables(doc)
raw_trends %>%
  count(location, trend_id) %>%
  head(5)
#> # A tibble: 5 × 3
#>   location                                      trend_id     n
#>   <chr>                                         <chr>    <int>
#> 1 Brownsville, TX to the Florida Keys, FL (USA) 3540        23
#> 2 North Carolina to Brownsville (USA)           3537        26
#> 3 North Gulf of Mexico                          3543        60
#> 4 North Gulf of Mexico                          3544        60
#> 5 North Gulf of Mexico (USA)                    3538        22

`parse_references(doc)`

Collects unique a[href^='/references/'] from trait and trend tables.

sharkipediaR:::parse_references(doc) %>% head(5)
#> # A tibble: 5 × 2
#>   reference_id      reference_url                                           
#>   <chr>             <chr>                                                   
#> 1 clementi2021      https://www.sharkipedia.org/references/clementi2021     
#> 2 driggers2004a     https://www.sharkipedia.org/references/driggers2004a    
#> 3 trinidadcruz1997  https://www.sharkipedia.org/references/trinidadcruz1997 
#> 4 uribemartinez1993 https://www.sharkipedia.org/references/uribemartinez1993
#> 5 peterson2017      https://www.sharkipedia.org/references/peterson2017

`link_text_or_href(node)`

Utility for anchor text vs. href basename (internal).

`R/clean.R`

Cleaning standardises names, types, and provenance.

`standardize_trait_columns(df)`

Maps Sharkipedia headers to snake_case:

HTML	R column
Name	trait_name
Value	value
Standard	standard
ValueType	value_type
Sex	sex
Location	location
Reference	reference

`clean_traits(df, species, source_url, retrieved_at)`

Renames columns via standardize_trait_columns()
Adds species, source_url, retrieved_at (UTC POSIXct)
str_squish() + na_if("", .) on character fields
Strips /references/ prefixes from reference

traits_clean <- sharkipediaR:::clean_traits(
  raw_traits,
  species = "Carcharhinus acronotus",
  source_url = source_url,
  retrieved_at = retrieved_at
)
names(traits_clean)
#>  [1] "trait_name"   "value"        "standard"     "value_type"   "sex"         
#>  [6] "location"     "reference"    "trait_group"  "species"      "source_url"  
#> [11] "retrieved_at"

`clean_trends(df, species, source_url, retrieved_at)`

Coerces year to integer, keeps value as double
Adds species and provenance columns
Squashes whitespace on metadata fields

trends_clean <- sharkipediaR:::clean_trends(
  raw_trends,
  species = "Carcharhinus acronotus",
  source_url = source_url,
  retrieved_at = retrieved_at
)
summary(trends_clean$year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1950    1970    1989    1986    2001    2018

`R/validate.R`

Validation is the last gate before data reach the user.

`validate_traits(df)`

Required columns: species, trait_name, value, source_url, retrieved_at
Warns (does not error) on zero rows — page structure may have changed

`validate_trends(df)`

Required columns: species, year, value, source_url, retrieved_at
Same empty-data warning pattern

sharkipediaR:::validate_traits(traits_clean) %>% nrow()
#> [1] 39
sharkipediaR:::validate_trends(trends_clean) %>% nrow()
#> [1] 321

`R/sp_species_urls.R`

`sp_species_urls(all_pages = FALSE, max_pages = NULL, cache = TRUE)`

Argument	Behaviour
`all_pages = FALSE`	Only page 1 of `?all=true` index (~20 species)
`all_pages = TRUE`	Reads pagination from page 1, loops all pages
`max_pages`	Caps pagination when exploring
`cache`	Uses memoised `fetch_page`

Returns deduplicated tibble: species, slug, url.

sp_species_urls()
sp_species_urls(all_pages = TRUE, max_pages = 3)

`R/sp_search.R`

`sp_search(query, index = NULL, all_pages = FALSE, cache = TRUE)`

Normalises query via ensure_species_vector().
Builds or accepts index tibble.
grepl(..., fixed = TRUE) on lowercased species and slug.
distinct(url).

Best practice: idx <- sp_species_urls(all_pages = TRUE) once, then sp_search("rhincodon", index = idx).

`R/sp_species.R`

`sp_species(species, cache = TRUE)`

End-user taxonomy extractor. Returns one row:

species, superorder, subclass, order, family, source_url, retrieved_at

ex <- example_carcharhinus()
ex$species_meta
#> # A tibble: 1 × 7
#>   species        superorder subclass order family source_url retrieved_at       
#>   <chr>          <chr>      <chr>    <chr> <chr>  <chr>      <dttm>             
#> 1 Carcharhinus … Galeomorp… Elasmob… Carc… Carch… https://w… 2026-05-25 12:00:00

`R/sp_traits.R`

`sp_traits(species, cache = TRUE)`

Public API for traits.

One species: calls fetch_species_traits().
Many species: purrr::map_dfr() with column species_input.

`fetch_species_traits(species, cache = TRUE)` — internal

Orchestrates the full trait pipeline documented above.

# Equivalent to fetch_species_traits() without HTTP:
traits_clean <- sharkipediaR:::validate_traits(traits_clean)
nrow(traits_clean)
#> [1] 39

`R/sp_trends.R`

`sp_trends(species, cache = TRUE)`

Public API for population trends (long format).

`fetch_species_trends(species, cache = TRUE)` — internal

Same pattern as traits: fetch → parse → clean → validate.

nrow(trends_clean)
#> [1] 321

`R/sp_references.R`

`sp_references(species, cache = TRUE)`

Fetches species page, runs parse_references(), adds species, source_url, retrieved_at.

Use to build a reference lookup table before joining to traits/trends.

`R/example_data.R`

`example_carcharhinus()` — exported

Loads inst/extdata/carcharhinus_acronotus.rds — a list with pre-parsed components used in all vignettes.

Rebuild after fixture updates:

source("data-raw/build-vignette-data.R")

End-to-end manual pipeline (advanced users)

Reproduce sp_traits() without calling it:

url <- sharkipediaR:::species_name_to_url("Carcharhinus acronotus")
# doc <- fetch_page(url)   # live
# raw <- parse_traits_tables(doc)
# out <- validate_traits(clean_traits(raw, "Carcharhinus acronotus", url))
# Identical structure to example_carcharhinus()$traits

This is the extension point if Sharkipedia adds a JSON API later: swap fetch_page() + parse_*() implementations while keeping sp_traits() stable.

Testing and pkgdown

Unit tests use HTML fixtures in tests/testthat/fixtures/ (no network).
Vignettes use example_carcharhinus() for plots and tables.
pkgdown picks up vignettes automatically after pkgdown::build_site().

See Getting started and Ecological workflows for ggplot demonstrations and scientific motivation.

Pablo Fuenzalida

2026-05-26

Design philosophy

R/constants.R

R/utils.R

rate_limit_pause(min_gap = 0.5)

species_name_to_slug(name)

species_name_to_url(name)

normalize_sharkipedia_url(url)

resolve_species_url(x)

ensure_species_vector(x)

read_html_fixture(path)

get_fetch_page(cache = TRUE)

R/fetch.R

fetch_page(url, quiet = TRUE) — exported

fetch_page_memoised

R/parse.R

parse_species_index(doc)

parse_index_last_page(doc)

parse_taxonomy(doc)

Trait table helpers

parse_traits_tables(doc)

decode_react_props(props)

parse_trends_tables(doc)

parse_references(doc)

link_text_or_href(node)

R/clean.R

standardize_trait_columns(df)

clean_traits(df, species, source_url, retrieved_at)

clean_trends(df, species, source_url, retrieved_at)

R/validate.R

validate_traits(df)

validate_trends(df)

R/sp_species_urls.R

sp_species_urls(all_pages = FALSE, max_pages = NULL, cache = TRUE)

R/sp_search.R

sp_search(query, index = NULL, all_pages = FALSE, cache = TRUE)

R/sp_species.R

sp_species(species, cache = TRUE)

R/sp_traits.R

sp_traits(species, cache = TRUE)

fetch_species_traits(species, cache = TRUE) — internal

R/sp_trends.R

sp_trends(species, cache = TRUE)

fetch_species_trends(species, cache = TRUE) — internal

R/sp_references.R

sp_references(species, cache = TRUE)

R/example_data.R

example_carcharhinus() — exported