API documentation¶
This documentation is based on the source code of version 1.0.1 of the pdiffcopy package. The following modules are available:
pdiffcopy¶
Configuration defaults for the pdiffcopy program.
-
pdiffcopy.BLOCK_SIZE= 1048576¶ The default block size to be used by
pdiffcopy(1 MiB).
-
pdiffcopy.DEFAULT_CONCURRENCY= 2¶ The default concurrency to be used by
pdiffcopy(at least two, at most 1/3 of available cores).
-
pdiffcopy.DEFAULT_PORT= 8080¶ The default port number for the
pdiffcopyserver (an integer number, defaults to 8080).
pdiffcopy.cli¶
Usage: pdiffcopy [OPTIONS] [SOURCE, TARGET]
Synchronize large binary data files between Linux servers at blazing speeds by performing delta transfers and spreading the work over many CPU cores.
One of the SOURCE and TARGET arguments is expected to be the pathname of a local file and the other argument is expected to be a URL that provides the location of a remote pdiffcopy server and a remote filename. File data will be read from SOURCE and written to TARGET.
If no positional arguments are given the server is started.
Supported options:
| Option | Description |
|---|---|
-b, --block-size=BYTES |
Customize the block size of the delta transfer. Can be a plain integer number (bytes) or an expression like 5K, 1MiB, etc. |
-m, --hash-method=NAME |
Customize the hash method of the delta transfer (defaults to ‘sha1’ but supports all hash methods provided by the Python hashlib module). |
-W, --whole-file |
Disable the delta transfer algorithm (skips computing of hashing and downloads all blocks unconditionally). |
-c, --concurrency=COUNT |
Change the number of parallel block hash / copy operations. |
-n, --dry-run |
Scan for differences between the source and target file and report the similarity index, but don’t write any changed blocks to the target. |
-B, --benchmark=COUNT |
Evaluate the effectiveness of delta transfer by mutating the TARGET
file (which must be a local file) and resynchronizing its contents.
This process is repeated COUNT times, with varying similarity.
At the end an overview is printed. |
-l, --listen=ADDRESS |
Listen on the specified IP:PORT or PORT. |
-v, --verbose |
Increase logging verbosity (can be repeated). |
-q, --quiet |
Decrease logging verbosity (can be repeated). |
-h, --help |
Show this message and exit. |
pdiffcopy.client¶
Parallel, differential file copy client.
-
class
pdiffcopy.client.Client(**kw)[source]¶ Python API for the client side of the
pdiffcopyprogram.Here’s an overview of the
Clientclass:Superclass: PropertyManagerPublic methods: compute_transfer_size(),find_changes(),mutate_target(),run_benchmark(),synchronize(),synchronize_once()andtransfer_changes()Properties: benchmark,block_size,concurrency,delta_transfer,dry_run,hash_method,sourceandtargetYou can set the values of the
benchmark,block_size,concurrency,delta_transfer,dry_run,hash_method,sourceandtargetproperties by passing keyword arguments to the class initializer.-
benchmark[source]¶ How many times the benchmark should be run (an integer, defaults to 0).
Note
The
benchmarkproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
block_size[source]¶ The block size used by the client.
Note
The
block_sizeproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
concurrency[source]¶ The number of parallel processes that the client is allowed to start.
Note
The
concurrencyproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
delta_transfer[source]¶ Whether delta transfer is enabled (a boolean, defaults to
True).Note
The
delta_transferproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
dry_run[source]¶ Whether the client is allowed to make changes.
Note
The
dry_runproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
hash_method[source]¶ The block hash method (a string, defaults to ‘sha1’).
Note
The
hash_methodproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
source[source]¶ The
Locationfrom which data is read.Note
The
sourceproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
target[source]¶ The
Locationto which data is written.Note
The
targetproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
compute_transfer_size(offsets)[source]¶ Figure out how much data we’re going to transfer.
Parameters: offsets – a list of integers with the offsets of the blocks to be synchronized. Returns: The amount of data to be transferred in bytes (an integer). This would be trivially easy if it wasn’t for the last block which can be smaller than the block size. Depending on the configured block size and the size of the file being synchronized the difference may be negligible or quite significant, so we go to the effort of calculating this correctly.
-
synchronize_once()[source]¶ Synchronize from
sourcetotarget.Returns: The number of blocks that differed (an integer).
-
find_changes()[source]¶ Helper for
synchronize()to compute the similarity index.
-
transfer_changes(offsets)[source]¶ Helper for
synchronize()to transfer the differences.Parameters: offsets – A list of integers with the byte offsets of the blocks to copy from sourcetotarget.
-
-
pdiffcopy.client.get_hashes_fn(location, **options)[source]¶ Adapter for
multiprocessingused byClient.find_changes().
-
pdiffcopy.client.transfer_block_fn(offset, source, target, block_size)[source]¶ Adapter for
multiprocessingused byClient.transfer_changes().
-
class
pdiffcopy.client.Location(**kw)[source]¶ A local or remote file to be copied.
Here’s an overview of the
Locationclass:Superclass: PropertyManagerPublic methods: get_hashes(),get_url(),read_block(),resize()andwrite_block()Properties: exists,expression,file_info,file_size,filename,hostname,labelandport_numberYou can set the values of the
expression,filename,hostnameandport_numberproperties by passing keyword arguments to the class initializer.-
exists[source]¶ Trueif the file exists,Falseotherwise.Note
The
existsproperty is acached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedelordelattr().
-
expression[source]¶ The location expression (a string).
Note
The
expressionproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
filename[source]¶ The absolute pathname of the file to copy (a string).
Note
The
filenameproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
hostname[source]¶ The host name of a pdiffcopy server (a string or
None).Note
The
hostnameproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
label¶ A human friendly label for the location (a string).
-
port_number[source]¶ The port number of a pdiffcopy server (a number or
None).Note
The
port_numberproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
file_info[source]¶ A dictionary with file metadata.
Note
The
file_infoproperty is acached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedelordelattr().
-
file_size[source]¶ The size of the file in bytes (an integer).
Note
The
file_sizeproperty is acached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedelordelattr().
-
get_hashes(**options)[source]¶ Get the hashes of the blocks in a file.
Parameters: options – See get_url().Returns: A generator of tokens with two values each: - A byte offset into the file (an integer).
- The hash of the block starting at that offset (a string).
-
get_url(endpoint, **params)[source]¶ Get the server URL for the given endpoint.
Parameters: - endpoint – The name of a server side endpoint (a string).
- params – Any query string parameters.
-
read_block(offset, size)[source]¶ Read a block of data from
filename.Parameters: - offset – The byte offset where reading starts (an integer).
- size – The number of bytes to read (an integer).
Returns: A byte string.
-
pdiffcopy.exceptions¶
Custom exceptions raised by the pdiffcopy modules.
-
exception
pdiffcopy.exceptions.ProgramError(text, *args, **kw)[source]¶ The base exception class for all custom exceptions raised by the
pdiffcopymodules.-
__init__(text, *args, **kw)[source]¶ Initialize a
ProgramErrorobject.For argument handling see the
compact()function. The resulting string is used as the exception message.
-
pdiffcopy.hashing¶
Parallel hashing of files using multiprocessing and pdiffcopy.mp.
pdiffcopy.mp¶
Adaptations of multiprocessing that make it easier to do the right thing.
This module stands alone as a library used by the other modules that are
specialized to what pdiffcopy does (synchronizing files). I may end up
extracting this to a separate package at some point, because over the 10+ years
that I’ve been programming Python I’ve written an awful lot of plumbing code
for multiprocessing and it’s not exactly my favorite thing in the world
(I suck at reasoning about concurrency, like most people I guess).
-
class
pdiffcopy.mp.Promise(**options)[source]¶ Execute a Python function in a child process and retrieve its return value.
-
__init__(**options)[source]¶ Initialize a
Promiseobject.The initializer arguments are the same as for
multiprocessing.Process. The child process is started automatically.
-
-
class
pdiffcopy.mp.WorkerPool(**kw)[source]¶ Simple to use worker pool implementation using
multiprocessing.Here’s an overview of the
WorkerPoolclass:Superclass: PropertyManagerSpecial methods: __enter__(),__exit__()and__iter__()Properties: all_processes,concurrency,generator_fn,generator_process,input_queue,log_level,output_queue,polling_interval,worker_fnandworker_processesWhen you initialize a
WorkerPoolobject you are required to provide values for theconcurrency,generator_fnandworker_fnproperties. You can set the values of theconcurrency,generator_fn,log_level,polling_intervalandworker_fnproperties by passing keyword arguments to the class initializer.-
all_processes[source]¶ A list with all
multiprocessing.Processobjects used by the pool.Note
The
all_processesproperty is alazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
concurrency[source]¶ The number of processes allowed to run simultaneously (an integer).
Note
The
concurrencyproperty is arequired_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named concurrency (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
generator_fn[source]¶ A user defined generator to populate
input_queue.Note
The
generator_fnproperty is arequired_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named generator_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
generator_process[source]¶ A
multiprocessing.Processobject to rungenerator_fn.Note
The
generator_processproperty is alazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
input_queue[source]¶ The input queue (a
multiprocessing.Queueobject).Note
The
input_queueproperty is alazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
log_level[source]¶ The logging level to configure in child processes (an integer).
Defaults to the current log level in the parent process at the point when the worker processes are created.
Note
The
log_levelproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
output_queue[source]¶ The output queue (a
multiprocessing.Queueobject).Note
The
output_queueproperty is alazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
polling_interval[source]¶ The time to wait between checking
output_queue(a floating point number, defaults to 0.1 second).Note
The
polling_intervalproperty is amutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedelordelattr().
-
worker_fn[source]¶ A user defined worker function to consume
input_queueand populateoutput_queue.Note
The
worker_fnproperty is arequired_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named worker_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
worker_processes[source]¶ A list of
multiprocessing.Processobjects to runworker_fn.Note
The
worker_processesproperty is alazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
__iter__()[source]¶ Initialize the generator and worker processes and start yielding values from the
output_queue.
-
pdiffcopy.operations¶
Utility functions used by the client as well as the server.
-
pdiffcopy.operations.get_file_info(filename)[source]¶ Get information about a local file.
Parameters: filename – An absolute filename (a string). Returns: A dictionary with file metadata, currently only the file size is included. If the file doesn’t exist an empty dictionary is returned.
-
pdiffcopy.operations.get_file_size(filename)[source]¶ Get the size of a local file.
Parameters: filename – An absolute filename (a string). Returns: The size of the file (an integer) or Nonewhen the file doesn’t exist.
-
pdiffcopy.operations.read_block(filename, offset, size)[source]¶ Read a block of data from a local file.
Parameters: - filename – An absolute filename (a string).
- offset – The byte offset were reading starts (an integer).
- size – The number of bytes to read (an integer).
Returns: The read data (a byte string).