gwgen.utils module

Classes

TaskBase(stations, config, project_config, ...) Abstract base class for parameterization and evaluation tasks
TaskConfig(setup_from, to_csv, to_db, ...)
param setup_from:
 The method how to setup the instance either from
TaskManager([base_task, tasks, config]) A manager to run the tasks within a task framework
TaskMeta Meta class for the TaskBase

Functions

append_doc(namedtuple_cls, doc)
default_config([setup_from, to_csv, to_db, ...]) The default configuration for TaskBase instances.
dir_contains(dirname, path[, exists]) Check if a file of directory is contained in another.
download_file(url[, target]) Download a file from the internet
enhanced_config(config_cls, name)
file_len(fname) Get the number of lines in fname
get_module_path(mod) Convenience method to get the directory of a given python module
get_next_name(old[, fmt]) Return the next name that numerically follows old
get_postgres_engine(database[, user, host, ...]) Get the engine to access the given database
get_toplevel_module(mod)
go_through_dict(key, d[, setdefault]) Split up the key by .
init_interprocess_locks(db_locks, ...)
init_locks(db_locks, file_locks)
isstring(s)
ordered_move(d, to_move, pos) Move a key in an ordered dictionary to another position
safe_csv_append(df, path, *args, **kwargs) Convenience method to dump a data frame to csv without removing the old
str_ranges(s) Convert a string of comma separated values to an iterable
unique_everseen(iterable[, key]) List unique elements, preserving order.
class gwgen.utils.TaskBase(stations, config, project_config, global_config, data=None, requirements=None, *args, **kwargs)[source]

Bases: object

Abstract base class for parameterization and evaluation tasks

Abstract base class that introduces the methods for the parameterization and evaluation framework. The name of the task is specified in the name attribute. You can implement the connection to other tasks (within the same framework) in the setup_requires attribute. The corresponding instances to the identifiers in the setup_requires attribute can later be accessed through the given attribute.

Examples

Let’s define a parameterizer that does nothing but setup_requires another parameterization task named cloud as connection:

>>> class CloudParameterizer(Parameterizer):
...     name = 'cloud'
...     def setup_from_scratch(self):
...         pass
...
>>> class DummyParameterizer(Parameterizer):
...     setup_requires = ['cloud']
...     name = 'dummy'
...     def setup_from_scratch(self):
...         pass
...
>>> cloud = CloudParameterizer()
>>> dummy = DummyParameterizer(cloud=cloud)
>>> dummy.cloud is cloud
True

Attributes

cloud_dir str. Path to the directory were the processed parameterization
data pandas.DataFrame. The dataframe holding the daily data
data_dir str. Path to the directory where the source data of the project
datafile str. The path to the csv file where the data is stored by the
dbname The database name to use
default_config The default configuration of this task inserted with the
df_ref The reference data frame
engine The sqlalchemy engine to access the database
eval_dir str. Path to the directory were the processed evaluation data is
fmt dict. Formatoptions to use when making plots with this task
has_run bool. Boolean that is True if there is a run method for this task
input_dir str. Path to the directory were the input data is stored
input_path The path to the project input file in the configuration
logger The logger of this task
name str. name of the task
nc_file NetCDF file for the project
output_dir str. Path to the directory were the input data is stored
output_path The path to the project output file in the configuration
param_dir str. Path to the directory were the processed parameterization
pdf_file pdf file with figures the project
project_file Pickle file for the project
reference_path The path to the reference file in the configuration
sa_dir str. Path to the directory were the processed sensitivity analysis
setup_from
setup_parallel bool. Boolean that is True if the task can be setup in parallel
setup_requires list of str. identifiers of required classes for this task
sql_dtypes The data types to write the data into a postgres database
summary str. summary of what this task does
task_data_dir The directory where to store data
threads threading.Thread objects that are started during the setup.

Methods

create_project(ds) To be reimplemented for each task with has_run
from_organizer(organizer, stations, *args, ...) Create a new instance from a model_organization.ModelOrganizer
from_task(task, *args, **kwargs) Create a new instance from another task
get_manager(*args, **kwargs) Return a manager of this class that can be used to setup and organize
get_run_kws(kwargs)
init_from_db() Initialize the task from datatables already created
init_from_file() Initialize the task from already stored files
init_from_scratch() Initialize the task from the configuration settings
init_task() Method that is called on the I/O-Processor to initialize the setup
make_run_config(sp, info) Method to be reimplemented for each task with has_run
plot_additionals(pdf) Method to be reimplemented to make additional plots (if necessary)
run(info, *args, **kwargs) Run the task
set_requirements(requirements) Set the requirements for this task
setup() Set up the database for this task
setup_from_db(**kwargs) Set up the task from datatables already created
setup_from_file(**kwargs) Set up the task from already stored files
setup_from_instances(base, instances[, copy]) Combine multiple task instances into one instance
setup_from_scratch() Setup the data from the configuration settings
write2db(**kwargs) Write the data from this task to the database given by the
write2file(**kwargs) Write the database to the datafile file
Parameters:
  • stations (list) – The list of stations to process
  • config (dict) – The configuration of the experiment
  • project_config (dict) – The configuration of the underlying project
  • global_config (dict) – The global configuration
  • data (pandas.DataFrame) – The data to use. If None, use the setup() method
  • requirements (list of TaskBase instances) – The required instances. If None, you must call the set_requirements() method later
Other Parameters:
 

``*args, **kwargs`` – The configuration of the task. See the TaskConfig for arguments. Note that if you provide *args, you have to provide all possible arguments

cloud_dir

str. Path to the directory were the processed parameterization data is stored

create_project(ds)[source]

To be reimplemented for each task with has_run

Parameters:ds (xarray.Dataset) – The dataset to plot
data = None

pandas.DataFrame. The dataframe holding the daily data

data_dir

str. Path to the directory where the source data of the project is located

datafile

str. The path to the csv file where the data is stored by the Parameterizer.write2file() method and read by the Parameterizer.setup_from_file()

dbname = ''

The database name to use

default_config

The default configuration of this task inserted with the pdf_file, nc_file and project_file attributes

df_ref

The reference data frame

engine

The sqlalchemy engine to access the database

eval_dir

str. Path to the directory were the processed evaluation data is stored

fmt = {}

dict. Formatoptions to use when making plots with this task

classmethod from_organizer(organizer, stations, *args, **kwargs)[source]

Create a new instance from a model_organization.ModelOrganizer

Parameters:
Other Parameters:
 

``*args, **kwargs`` – The configuration of the task. See the TaskConfig for arguments. Note that if you provide *args, you have to provide all possible arguments

Returns:

An instance of the calling class

Return type:

TaskBase

classmethod from_task(task, *args, **kwargs)[source]

Create a new instance from another task

Parameters:
  • task (TaskBase) – The organizer to use the configuration from. Note that it can also be of a different type than this class
  • data (pandas.DataFrame) – The data to use. If None, use the setup() method
  • requirements (list of TaskBase instances) – The required instances. If None, you must call the set_requirements() method later
Other Parameters:
 

``*args, **kwargs`` – The configuration of the task. See the TaskConfig for arguments. Note that if you provide *args, you have to provide all possible arguments

See also

setup_from_instances()
To combine multiple instances of the class

Notes

Besides the skip_filtering parameter, the task_config is not inherited from task

classmethod get_manager(*args, **kwargs)[source]

Return a manager of this class that can be used to setup and organize tasks

get_run_kws(kwargs)[source]
has_run = False

bool. Boolean that is True if there is a run method for this task

init_from_db()[source]

Initialize the task from datatables already created

init_from_file()[source]

Initialize the task from already stored files

init_from_scratch()[source]

Initialize the task from the configuration settings

init_task()[source]

Method that is called on the I/O-Processor to initialize the setup

input_dir

str. Path to the directory were the input data is stored

input_path

The path to the project input file in the configuration

logger

The logger of this task

make_run_config(sp, info)[source]

Method to be reimplemented for each task with has_run to manipulate the configuration

Parameters:
  • sp (psyplot.project.Project) – The project of the data
  • info (dict) – The dictionary for saving additional information of the task
name = None

str. name of the task

nc_file

NetCDF file for the project

output_dir

str. Path to the directory were the input data is stored

output_path

The path to the project output file in the configuration

param_dir

str. Path to the directory were the processed parameterization data is stored

pdf_file

pdf file with figures the project

plot_additionals(pdf)[source]

Method to be reimplemented to make additional plots (if necessary)

Parameters:pdf (matplotlib.backends.backend_pdf.PdfPages) – The PdfPages instance which can be used to save the figure
project_file

Pickle file for the project

reference_path

The path to the reference file in the configuration

run(info, *args, **kwargs)[source]

Run the task

This method uses the data that has been setup through the setup() method to process some configuration

Parameters:
  • dict – The dictionary with the configuration settings for the namelist
  • dict – The dictionary holding additional meta information
sa_dir

str. Path to the directory were the processed sensitivity analysis data is stored

set_requirements(requirements)[source]

Set the requirements for this task

Parameters:requirements (list of TaskBase instances) – The tasks as specified in the setup_requires attribute
setup()[source]

Set up the database for this task

setup_from
setup_from_db(**kwargs)[source]

Set up the task from datatables already created

setup_from_file(**kwargs)[source]

Set up the task from already stored files

classmethod setup_from_instances(base, instances, copy=False)[source]

Combine multiple task instances into one instance

Parameters:
  • base (TaskBase) – The base task to use the configuration from
  • instances (list of TaskBase) – The tasks containing the data
  • copy (bool) – If True, a copy of base is returned, otherwise base is modified inplace
setup_from_scratch()[source]

Setup the data from the configuration settings

setup_parallel = True

bool. Boolean that is True if the task can be setup in parallel

setup_requires = []

list of str. identifiers of required classes for this task

sql_dtypes

The data types to write the data into a postgres database

summary = ''

str. summary of what this task does

task_data_dir

The directory where to store data

threads = []

threading.Thread objects that are started during the setup. It will be waited for them to finish before continuing with another process

write2db(**kwargs)[source]

Write the data from this task to the database given by the engine attribute

write2file(**kwargs)[source]

Write the database to the datafile file

class gwgen.utils.TaskConfig(setup_from, to_csv, to_db, remove, skip_filtering, plot_output, nc_output, project_output, new_project, project, close)

Bases: gwgen.utils.TaskConfig

Parameters:
  • setup_from ({ 'scratch' | 'file' | 'db' | None }) –

    The method how to setup the instance either from

    'scratch'
    To set up the task from the raw data
    'file'
    Set up the task from an existing file
    'db'
    Set up the task from a database
    None
    If the file name of this this task exists, use this one, otherwise a database is provided, use this one, otherwise go from scratch
  • to_csv (bool) – If True, the data at setup will be written to a csv file
  • to_db (bool) – If True, the data at setup will be written to into a database
  • remove (bool) – If True and the old data file already exists, remove before writing to it
  • skip_filtering (bool) – If True, skip the filtering for the correct stations in the datafile
  • plot_output (str) – An alternative path to use for the PDF file of the plot
  • nc_output (str) – An alternative path (or multiples depending on the task) to use for the netCDF file of the plot data
  • project_output (str) – An alternative path to use for the psyplot project file of the plot
  • new_project (bool) – If True, a new project will be created even if a file in project_output exists already
  • project (str) – The path to a psyplot project file to use for this parameterization
  • close (bool) – Close the project at the end
class gwgen.utils.TaskManager(base_task=<class 'gwgen.utils.TaskBase'>, tasks=None, config={})[source]

Bases: object

A manager to run the tasks within a task framework

Parameters:
  • base_task (TaskBase) – A subclass of the TaskBase class whose tasks shall be used within this manager.
  • tasks (list of TaskBase instances) – The initialized tasks to use. If None, you need to call the initialize_tasks() method
  • config (dict) – The configuration of this manager containing information about the multiprocessing

Attributes

base_task A subclass of the TaskBase class whose
logger The logger of this task

Methods

get_requirements(identifier[, all_requirements]) Return the required task classes for this task
get_task(identifier) Return the task corresponding in this manager of identifier
get_task_cls(identifier) Return the task class corresponding to the given identifier
initialize_tasks(stations[, task_kws]) Initialize the setup of the tasks
run(full_info, *args)
setup(stations[, to_return]) Setup the data for the tasks in parallel or serial
sort_by_requirement(objects) Sort the given tasks by their logical order
base_task = None

A subclass of the TaskBase class whose TaskBase._registry attribute shall be used

get_requirements(identifier, all_requirements=True)[source]

Return the required task classes for this task

Parameters:
  • identifier (str) – The name attribute of the Parameterizer subclass
  • all_requirements (bool) – If True, all requirements are searched recursively. Otherwise only the direct requirements are returned
Returns:

A list of Parameterizer subclasses that are required for the task of the given identifier

Return type:

list of Parameterizer

get_task(identifier)[source]

Return the task corresponding in this manager of identifier

Parameters:identifier (str) – The name attribute of the TaskBase subclass
Returns:The requested task
Return type:TaskBase
get_task_cls(identifier)[source]

Return the task class corresponding to the given identifier

Parameters:identifier (str) – The name attribute of the TaskBase subclass
Returns:The class of the requested task
Return type:TaskBase
initialize_tasks(stations, task_kws={})[source]

Initialize the setup of the tasks

This classmethod uses the TaskBase framework to initialize the setup on the I/O-processor

Parameters:
  • stations (list) – The list of stations to process
  • task_kws (dict) – Keywords can be valid identifiers of the TaskBase instances, dictionaries may be mappings for their setup() method
logger

The logger of this task

run(full_info, *args)[source]
setup(stations, to_return=None)[source]

Setup the data for the tasks in parallel or serial

Parameters:
  • stations (list of str) – The stations to process
  • to_return (list of str) – The names of the tasks to return. If None, all tasks that have a run method will be returned
static sort_by_requirement(objects)[source]

Sort the given tasks by their logical order

Parameters:objects (list of TaskBase subclasses or instances) – The objects to sort
Returns:The same as objects but sorted
Return type:list of TaskBase subclasses or instances
class gwgen.utils.TaskMeta[source]

Bases: abc.ABCMeta

Meta class for the TaskBase

gwgen.utils.append_doc(namedtuple_cls, doc)[source]
gwgen.utils.default_config(setup_from=None, to_csv=False, to_db=False, remove=False, skip_filtering=False, plot_output=None, nc_output=None, project_output=None, new_project=False, project=None, close=True)[source]

The default configuration for TaskBase instances. See also the TaskBase.default_config attribute

Parameters:
  • setup_from ({ 'scratch' | 'file' | 'db' | None }) –

    The method how to setup the instance either from

    'scratch'
    To set up the task from the raw data
    'file'
    Set up the task from an existing file
    'db'
    Set up the task from a database
    None
    If the file name of this this task exists, use this one, otherwise a database is provided, use this one, otherwise go from scratch
  • to_csv (bool) – If True, the data at setup will be written to a csv file
  • to_db (bool) – If True, the data at setup will be written to into a database
  • remove (bool) – If True and the old data file already exists, remove before writing to it
  • skip_filtering (bool) – If True, skip the filtering for the correct stations in the datafile
  • plot_output (str) – An alternative path to use for the PDF file of the plot
  • nc_output (str) – An alternative path (or multiples depending on the task) to use for the netCDF file of the plot data
  • project_output (str) – An alternative path to use for the psyplot project file of the plot
  • new_project (bool) – If True, a new project will be created even if a file in project_output exists already
  • project (str) – The path to a psyplot project file to use for this parameterization
  • close (bool) – Close the project at the end
gwgen.utils.dir_contains(dirname, path, exists=True)[source]

Check if a file of directory is contained in another.

Parameters:
  • dirname (str) – The base directory that should contain path
  • path (str) – The name of a directory or file that should be in dirname
  • exists (bool) – If True, the path and dirname must exist

Notes

path and dirname must be either both absolute or both relative paths

gwgen.utils.download_file(url, target=None)[source]

Download a file from the internet

Parameters:
  • url (str) – The url of the file
  • target (str or None) – The path where the downloaded file shall be saved. If None, it will be saved to a temporary directory
Returns:

file_name – the downloaded filename

Return type:

str

gwgen.utils.enhanced_config(config_cls, name)[source]
gwgen.utils.file_len(fname)[source]

Get the number of lines in fname

gwgen.utils.get_module_path(mod)[source]

Convenience method to get the directory of a given python module

gwgen.utils.get_next_name(old, fmt='%i')[source]

Return the next name that numerically follows old

gwgen.utils.get_postgres_engine(database, user=None, host='127.0.0.1', port=None, create=False, test=False)[source]

Get the engine to access the given database

This method creates an engine using sqlalchemy’s create_engine function to access the given database via postgresql. If the database is not existent, it will be created

Parameters:
  • database (str) – The name of a psql database. If provided, the processed data will be stored
  • user (str) – The username to use when logging into the database
  • host (str) – the host which runs the database server
  • port (int) – The port to use to log into the the database
  • create (bool) – If True, it is tried to create the database if not existent as postgres user
  • test (bool) – If True, test the connection before returning the engine
Returns:

Tha engine to access the database

Return type:

sqlalchemy.engine.base.Engine

Notes

The engine is for single usage!

gwgen.utils.get_toplevel_module(mod)[source]
gwgen.utils.go_through_dict(key, d, setdefault=None)[source]

Split up the key by . and get the value from the base dictionary d

Parameters:
  • key (str) – The key in the config configuration. If the key goes some levels deeper, keys may be separated by a '.' (e.g. 'namelists.weathergen'). Hence, to insert a ',', it must be escaped by a preceeding ''.
  • d (dict) – The configuration dictionary containing the key
  • setdefault (callable) – If not None and an item is not existent in d, it is created by calling the given function
Returns:

  • str – The last level of the key
  • dict – The dictionary in d that contains the last level of the key

gwgen.utils.init_interprocess_locks(db_locks, file_locks, lock_dir)[source]
gwgen.utils.init_locks(db_locks, file_locks)[source]
gwgen.utils.isstring(s)[source]
gwgen.utils.ordered_move(d, to_move, pos)[source]

Move a key in an ordered dictionary to another position

Parameters:
  • d (collections.OrderedDict) – The dictionary containing the keys
  • to_move (str) – The key to move
  • pos (str) – The name of the key that should be followed by to_move
gwgen.utils.safe_csv_append(df, path, *args, **kwargs)[source]

Convenience method to dump a data frame to csv without removing the old

This function dumps the given df to the file specified by path. If path already exists, we read the header of the file and sort df according to this header

Parameters:
gwgen.utils.str_ranges(s)[source]

Convert a string of comma separated values to an iterable

Parameters:s (str) – A comma (',') separated string. A single value in this string represents one number, ranges can also be used via a separation by comma ('-'). Hence, '2009,2012-2015' will be converted to [2009,2012, 2013, 2014] and 2009,2012-2015-2 to [2009, 2012, 2015]
Returns:The values in s converted to a list
Return type:list
gwgen.utils.unique_everseen(iterable, key=None)[source]

List unique elements, preserving order. Remember all elements ever seen.

Function taken from https://docs.python.org/2/library/itertools.html