Package architecture and function reference
Pablo Fuenzalida
2026-05-26
Source:vignettes/architecture-and-functions.Rmd
architecture-and-functions.Rmd
library(sharkipediaR)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
vignette_fixture <- function(name) {
path <- system.file("fixtures", name, package = "sharkipediaR")
if (!nzchar(path)) {
cli::cli_abort("Fixture not found: {name}")
}
path
}
doc <- xml2::read_html(vignette_fixture("carcharhinus_acronotus.html"))
index_doc <- xml2::read_html(vignette_fixture("species_index_page1.html"))
source_url <- "https://www.sharkipedia.org/species/carcharhinus-acronotus"
retrieved_at <- as.POSIXct("2026-05-25 12:00:00", tz = "UTC")Design philosophy
sharkipediaR follows a strict pipeline:
HTTP retrieval → HTML parsing → cleaning → validation → user-facing tibble
(fetch.R) (parse.R) (clean.R) (validate.R) (sp_*.R)
No step is allowed to “do everything.” That separation keeps the package maintainable when Sharkipedia’s HTML evolves.

Data flow for sp_traits()
R/constants.R
| Object / function | Role |
|---|---|
.sharkipedia_env |
Package-private environment storing last request time for rate limiting |
sharkipedia_base_url() |
Returns https://www.sharkipedia.org
|
sharkipedia_user_agent() |
Identifies the client in HTTP headers (version + project URL) |
These are internal but underpin polite scraping.
R/utils.R
rate_limit_pause(min_gap = 0.5)
Waits until at least min_gap seconds have elapsed since
the last request, plus a small random jitter
(runif(0, 0.3)). Called automatically by
fetch_page().
species_name_to_slug(name)
Converts "Carcharhinus acronotus" →
"carcharhinus-acronotus" (lowercase, spaces to
hyphens).
species_name_to_url(name)
Builds the full species page URL from a scientific name.
sharkipediaR:::species_name_to_url("Carcharhinus acronotus")
#> [1] "https://www.sharkipedia.org/species/carcharhinus-acronotus"
normalize_sharkipedia_url(url)
Accepts:
- full URLs,
- relative paths (
/species/...), - bare scientific names.
R/fetch.R
fetch_page(url, quiet = TRUE) —
exported
Responsibilities only: HTTP, rate limit, retries,
return xml_document.
| Step | Implementation |
|---|---|
| URL normalisation | normalize_sharkipedia_url() |
| Politeness | rate_limit_pause() |
| Request |
httr2::request() + user agent +
req_retry(max_tries = 3)
|
| Error handling | Abort on status ≥ 400 with cli message |
| Body |
httr2::resp_body_html() → xml2
document |
doc_live <- fetch_page("Carcharhinus acronotus", quiet = FALSE)
class(doc_live)
R/parse.R
Parsing functions never perform HTTP. They only read
xml_document objects.
parse_species_index(doc)
Used by: sp_species_urls()
- Selector:
a[href^='/species/'] - Excludes CSV export links (
*.csvslugs) - Returns
species,slug,url
idx <- sharkipediaR:::parse_species_index(index_doc)
head(idx, 4)
#> # A tibble: 4 × 3
#> species slug url
#> <chr> <chr> <chr>
#> 1 Rajella leopardus rajella-leopardus https://www.sharkipedia…
#> 2 Gymnura lessae gymnura-lessae https://www.sharkipedia…
#> 3 Styracura schmardae html-i-styracura-schmardae-i-html https://www.sharkipedia…
#> 4 Dipturus lamillai dipturus-lamillai https://www.sharkipedia…
parse_index_last_page(doc)
Reads Bulma pagination links (.pagination-link) and
returns the maximum page number — used when
sp_species_urls(all_pages = TRUE).
parse_taxonomy(doc)
Used by: sp_species()
- Species name:
h1.title - Ranks:
div.columns > div.column:first-child > pwith labelsSuperorder:,Subclass:,Order:,Family:
sharkipediaR:::parse_taxonomy(doc)
#> # A tibble: 1 × 5
#> species superorder subclass order family
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Carcharhinus acronotus Galeomorphii Elasmobranchii Carcharhiniformes Carcharh…Trait table helpers
| Function | Role |
|---|---|
trait_table_headers() |
Reads <thead><th> text |
is_trait_table() |
TRUE if headers include "Name"
|
preceding_trait_group() |
XPath preceding::h4[contains(@class,'subtitle')][1] for
trait class |
parse_traits_tables(doc)
Used by: sp_traits() via
fetch_species_traits()
- Find all
table.tableelements. - Keep tables whose headers match Sharkipedia’s trait schema.
-
rvest::html_table()for body rows. - Attach
trait_groupfrom the nearest precedingh4.
raw_traits <- sharkipediaR:::parse_traits_tables(doc)
names(raw_traits)
#> [1] "Name" "Value" "Standard" "ValueType" "Sex"
#> [6] "Location" "Reference" "trait_group"
head(raw_traits, 3)
#> # A tibble: 3 × 8
#> Name Value Standard ValueType Sex Location Reference trait_group
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct Predation Medi… Effect … mean Pool… Caribbe… clementi… Ecological…
#> 2 Direct Predation Medi… Strengt… mean Pool… Caribbe… clementi… Ecological…
#> 3 Amat50 4.3 Year mean Male South C… driggers… AgeNote: column names are still PascalCase
(Name, Value, …) at this stage.
decode_react_props(props)
Decodes HTML entities in data-react-props JSON and
parses with jsonlite. Internal helper for trends.
parse_trends_tables(doc)
Used by: sp_trends()
- Locate the first table after the
h3heading containing"Trends". - For each
<tr>:-
location,unit,referencefrom table cells -
trend_id/trend_urlfrom the “Details” link -
observationsmatrix fromdiv[data-react-props]
-
- Expand to one row per year × value.
raw_trends <- sharkipediaR:::parse_trends_tables(doc)
raw_trends %>%
count(location, trend_id) %>%
head(5)
#> # A tibble: 5 × 3
#> location trend_id n
#> <chr> <chr> <int>
#> 1 Brownsville, TX to the Florida Keys, FL (USA) 3540 23
#> 2 North Carolina to Brownsville (USA) 3537 26
#> 3 North Gulf of Mexico 3543 60
#> 4 North Gulf of Mexico 3544 60
#> 5 North Gulf of Mexico (USA) 3538 22
parse_references(doc)
Collects unique a[href^='/references/'] from trait and
trend tables.
sharkipediaR:::parse_references(doc) %>% head(5)
#> # A tibble: 5 × 2
#> reference_id reference_url
#> <chr> <chr>
#> 1 clementi2021 https://www.sharkipedia.org/references/clementi2021
#> 2 driggers2004a https://www.sharkipedia.org/references/driggers2004a
#> 3 trinidadcruz1997 https://www.sharkipedia.org/references/trinidadcruz1997
#> 4 uribemartinez1993 https://www.sharkipedia.org/references/uribemartinez1993
#> 5 peterson2017 https://www.sharkipedia.org/references/peterson2017
R/clean.R
Cleaning standardises names, types, and provenance.
standardize_trait_columns(df)
Maps Sharkipedia headers to snake_case:
| HTML | R column |
|---|---|
| Name | trait_name |
| Value | value |
| Standard | standard |
| ValueType | value_type |
| Sex | sex |
| Location | location |
| Reference | reference |
clean_traits(df, species, source_url, retrieved_at)
- Renames columns via
standardize_trait_columns() - Adds
species,source_url,retrieved_at(UTC POSIXct) -
str_squish()+na_if("", .)on character fields - Strips
/references/prefixes fromreference
traits_clean <- sharkipediaR:::clean_traits(
raw_traits,
species = "Carcharhinus acronotus",
source_url = source_url,
retrieved_at = retrieved_at
)
names(traits_clean)
#> [1] "trait_name" "value" "standard" "value_type" "sex"
#> [6] "location" "reference" "trait_group" "species" "source_url"
#> [11] "retrieved_at"
clean_trends(df, species, source_url, retrieved_at)
- Coerces
yearto integer, keepsvalueas double - Adds species and provenance columns
- Squashes whitespace on metadata fields
trends_clean <- sharkipediaR:::clean_trends(
raw_trends,
species = "Carcharhinus acronotus",
source_url = source_url,
retrieved_at = retrieved_at
)
summary(trends_clean$year)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1950 1970 1989 1986 2001 2018
R/validate.R
Validation is the last gate before data reach the user.
R/sp_species_urls.R
sp_species_urls(all_pages = FALSE, max_pages = NULL, cache = TRUE)
| Argument | Behaviour |
|---|---|
all_pages = FALSE |
Only page 1 of ?all=true index (~20 species) |
all_pages = TRUE |
Reads pagination from page 1, loops all pages |
max_pages |
Caps pagination when exploring |
cache |
Uses memoised fetch_page
|
Returns deduplicated tibble: species, slug,
url.
sp_species_urls()
sp_species_urls(all_pages = TRUE, max_pages = 3)
R/sp_search.R
sp_search(query, index = NULL, all_pages = FALSE, cache = TRUE)
- Normalises
queryviaensure_species_vector(). - Builds or accepts
indextibble. -
grepl(..., fixed = TRUE)on lowercasedspeciesandslug. -
distinct(url).
Best practice:
idx <- sp_species_urls(all_pages = TRUE) once, then
sp_search("rhincodon", index = idx).
R/sp_species.R
sp_species(species, cache = TRUE)
End-user taxonomy extractor. Returns one row:
species, superorder, subclass,
order, family, source_url,
retrieved_at
ex <- example_carcharhinus()
ex$species_meta
#> # A tibble: 1 × 7
#> species superorder subclass order family source_url retrieved_at
#> <chr> <chr> <chr> <chr> <chr> <chr> <dttm>
#> 1 Carcharhinus … Galeomorp… Elasmob… Carc… Carch… https://w… 2026-05-25 12:00:00
R/sp_traits.R
sp_traits(species, cache = TRUE)
Public API for traits.
-
One species: calls
fetch_species_traits(). -
Many species:
purrr::map_dfr()with columnspecies_input.
fetch_species_traits(species, cache = TRUE) —
internal
Orchestrates the full trait pipeline documented above.
# Equivalent to fetch_species_traits() without HTTP:
traits_clean <- sharkipediaR:::validate_traits(traits_clean)
nrow(traits_clean)
#> [1] 39
R/sp_trends.R
fetch_species_trends(species, cache = TRUE) —
internal
Same pattern as traits: fetch → parse → clean → validate.
nrow(trends_clean)
#> [1] 321
R/sp_references.R
sp_references(species, cache = TRUE)
Fetches species page, runs parse_references(), adds
species, source_url,
retrieved_at.
Use to build a reference lookup table before joining to traits/trends.
R/example_data.R
example_carcharhinus() — exported
Loads inst/extdata/carcharhinus_acronotus.rds — a list
with pre-parsed components used in all vignettes.
Rebuild after fixture updates:
source("data-raw/build-vignette-data.R")End-to-end manual pipeline (advanced users)
Reproduce sp_traits() without calling it:
url <- sharkipediaR:::species_name_to_url("Carcharhinus acronotus")
# doc <- fetch_page(url) # live
# raw <- parse_traits_tables(doc)
# out <- validate_traits(clean_traits(raw, "Carcharhinus acronotus", url))
# Identical structure to example_carcharhinus()$traitsThis is the extension point if Sharkipedia adds a JSON API later:
swap fetch_page() + parse_*() implementations
while keeping sp_traits() stable.
Testing and pkgdown
-
Unit tests use HTML fixtures in
tests/testthat/fixtures/(no network). -
Vignettes use
example_carcharhinus()for plots and tables. -
pkgdown picks up vignettes automatically after
pkgdown::build_site().
See Getting started and Ecological workflows for ggplot demonstrations and scientific motivation.