Google Cloud
GCP for Archive and Analysis
National PAM SI Cloud Repo: more details on cloud storage and processing from the cloud team
Link to SWFSC GCP Bucket: our data storage site
Accessing Data
Check the archive spreadsheet to see if the dataset you are interested in accessing is available in the SWFSC-SAEL bucket. Note: only the raw audio data is uploaded here, see our NCEI archive for additional ancillary data.
The bucket is broken down by platform -> project -> deployment. You can select all data within a platform, or any subfolder within that directory, and click download.
Cloud Storage
Manual Upload
Open SWFSC bucket in GCP console
Select Upload (Upload files or Upload folder) and navigate to the data you want to upload
Click “Upload”
Large data files or datasets may crash GCP with this upload method. If you would like to upload large amounts of data, do so through the GCP sdk/gsutil options below
sdk/gsutil Set-Up
- Download the Google Cloud CLI installer and install Cloud SDK on your computer
- Open the ‘Google Cloud SDK Shell Terminal’, follow the prompts, and select ‘ggn-nmfs-pamdata-prod-1’ as your project
sdk/gsutil Upload
- Open the ‘Google Cloud SDK Shell Terminal’
- Enter the follow command to upload files to the GCP:
gsutil -m cp -r [source pathway to files to be uploaded/] gs:[destination pathway to folder on bucket/]
- SAEL ADRIFT Example
"gsutil -m cp -r E:\DATA\LEG_2\RECORDINGS\CalCurCEAS_004 gs://swfsc-1/drifting_recorder/CalCurCEAS_2024"
- SAEL ADRIFT Example
- Processing should start, wait for terminal prompt to update then refresh the SWFSC bucket website to see your uploaded data
Automated sdk/gsutil Upload
You can use the following R script to automatically run the above command and bulk upload data from a source. R script available here. This script is set up to only run from 6 pm to 6 am on weekdays and full time on weekends, to avoid excess network traffic.
# ----FUNCTIONS----
# Short recursive space-to-underscore renamer
<- function(path = ".") {
rename_spaces # Get all files and folders, deepest first
<- list.files(path, recursive = TRUE, full.names = TRUE, include.dirs = TRUE)
all_paths <- all_paths[order(nchar(all_paths), decreasing = TRUE)]
all_paths
# Rename each item that contains spaces
for (p in all_paths) {
if (grepl(" ", basename(p))) {
<- file.path(dirname(p), gsub(" ", "_", basename(p)))
new_path file.rename(p, new_path)
cat("Renamed:", basename(p), "->", basename(new_path), "\n")
}
}
}
# Check for weekday and hour, run between 6 AM and 6 PM on weekdays
<- function(x){
is_weekday = c('Thursday', 'Friday', 'Monday', 'Tuesday', 'Wednesday')
weekdays return(weekdays(x) %in% weekdays)
}
<- function(x){
is_after_hours return(as.integer(format(x,"%H")) < 6 | as.integer(format(x, "%H")) > 18)
}
# Copy function
<- function(Local_dir, Cloud_dir, file_type = NULL, log_file, offHoursCopy = TRUE) {
copy_files_to_GCP ## List files ####
<- paste(file_type, collapse = "|") # combine all file types to one regex (uses or expression)
file_type <- list.files(Local_dir, pattern = file_type, full.names = TRUE, recursive = TRUE)
good_files
## Copy files ####
for (file in good_files) {
# check for after hours, will run any time if offHoursCopy = FALSE above
if (offHoursCopy) {
while (is_weekday(Sys.time()) & !is_after_hours(Sys.time())) {
cat(paste('waiting...',Sys.time(), "\n"))
Sys.sleep(600) #wait 10 minutes until trying again
}
}
# build a relative GCP path to retain local folder structure
<- gsub(paste0("^", Local_dir, "/?"), "", file)
rel_path <- paste0("gs://", Cloud_dir, "/", rel_path)
GCP_path
# create gsutil command
<- paste0("gsutil -m cp -c -L \"", log_file, "\" \"", file, "\" \"", GCP_path, "\"")
cmd
# pass command to process and print progress message in R console
message("Uploading: ", file, " -> ", GCP_path)
system(cmd)
}
## Create Log ####
# read in log file created in cmd
<- read.csv(log_file, header = TRUE)
log_df
# subset files that successfully transferred and failed
<- subset(log_df, Result == "OK")
successful_log <- subset(log_df, Result == "ERROR" | Result == "SKIPPED")
failed_log
# number of files that successfully copied and failed
<- nrow(successful_log)
N_files_copied <- nrow(failed_log)
N_files_failed
# number of folders copied
$Folder <- dirname(successful_log$Source)
successful_log<- length(unique(successful_log$Folder))
num_folders_copied
# Total local size and total cloud size
<- sum(as.numeric(successful_log$Source.Size), na.rm = TRUE)
total_local_size <- sum(as.numeric(successful_log$Bytes.Transferred), na.rm = TRUE)
total_bytes_transferred
# create summary dataframe
<- data.frame(Files_Copied = N_files_copied, N_files_failed, Folders_Copied = num_folders_copied, Total_Local_Size_bytes = total_local_size, Total_Cloud_Size_bytes = total_bytes_transferred)
summary_df
<- sub("\\.txt$", ".xlsx", log_file)
xlsx_file
# Create a new workbook
<- createWorkbook()
wb
# Add Summary sheet
addWorksheet(wb, "Summary")
writeData(wb, "Summary", summary_df)
# Add Success Files Sheet
addWorksheet(wb, "Completed Files")
writeData(wb, "Completed Files", successful_log)
# Add Failed Files sheet
addWorksheet(wb, "Failed Files")
writeData(wb, "Failed Files", failed_log)
# Save the workbook
saveWorkbook(wb, xlsx_file, overwrite = TRUE)
}
# ----EXAMPLE WORKFLOW----
## ----STEP 1: Define parameters----
library(openxlsx)
# Path to Local Directory you want to copy FROM
<- "C:/Users/kourtney.burger/Desktop/test"
Local_dir # Path to Cloud Directory you want to copy TO
<- "swfsc-1/drifting_recorder/kbtest"
Cloud_dir # Log file name with path
<- paste0(getwd(), "/GCP_transfer_log_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".txt") # (used with gsutil -L command). Leave as is, unless you want to change where the log file is saved to. This structure is built to save it to your current working directory
log_file # Copy off hours?
<- FALSE # default is TRUE = Copy only on nights/weekends
offHoursCopy # Copy all files? Choose file type if you do not want copy all
<- NULL
file_type # file_type <- c("\\.wav$", "\\.log.xml$", "\\.accel.csv$", "\\.temp.csv$", "\\.ltsa$", "\\.flac$")
## ----STEP 2: (Optional) Rename files/folders to remove spaces----
# Spaces within your filenames/foldernames may lead to downstream processing issues
# We recommend you run this and replace any spaces in your ORIGINAL filenames/foldernames
# Risk: If you have documents, code that require the exact names (with spaces), you may have issues
# DO NOT RUN THIS LINE IF YOU NEED TO RETAIN SPACES IN YOUR FILE NAME, these functions will still work
rename_spaces(Local_dir)
## ----STEP 3: Copy files to GCP----
# progress will be printed to the R console, check the output xlsx file for errors
copy_files_to_GCP(
Local_dir = Local_dir,
Cloud_dir = Cloud_dir,
file_type = file_type,
log_file = log_file,
offHoursCopy = offHoursCopy
)
# --------------Final Notes:--------------
# - You must have gsutil installed and authenticated on your system
# - Make sure the GCP bucket exists and you have write permissions
# - Output Excel log will be saved to your working directory alongside your log txt file, summarizing success/failures
More information on Cloud Storage
See the PAM SI Cloud Storage website for more information and troubleshooting tips
Cloud Processing
work in progress
Cloud based processing is now available through google workstations using the data you upload to GCP. SAEL has not began processing in the cloud but you can follow the steps below to establish a windows virtual machine (workstation) and access the existing cloud based software
PAM Windows Workstation
Follow the detailed directions here to establish a windows workstation
More Information on Cloud Processing
See the PAM SI Cloud Processing website for more information and troubleshooting tips