API documentation¶
This documentation is based on the source code of version 1.0.1 of the pdiffcopy package. The following modules are available:
pdiffcopy
¶
Configuration defaults for the pdiffcopy
program.
-
pdiffcopy.
BLOCK_SIZE
= 1048576¶ The default block size to be used by
pdiffcopy
(1 MiB).
-
pdiffcopy.
DEFAULT_CONCURRENCY
= 2¶ The default concurrency to be used by
pdiffcopy
(at least two, at most 1/3 of available cores).
-
pdiffcopy.
DEFAULT_PORT
= 8080¶ The default port number for the
pdiffcopy
server (an integer number, defaults to 8080).
pdiffcopy.cli
¶
Usage: pdiffcopy [OPTIONS] [SOURCE, TARGET]
Synchronize large binary data files between Linux servers at blazing speeds by performing delta transfers and spreading the work over many CPU cores.
One of the SOURCE and TARGET arguments is expected to be the pathname of a local file and the other argument is expected to be a URL that provides the location of a remote pdiffcopy server and a remote filename. File data will be read from SOURCE and written to TARGET.
If no positional arguments are given the server is started.
Supported options:
Option | Description |
---|---|
-b , --block-size=BYTES |
Customize the block size of the delta transfer. Can be a plain integer number (bytes) or an expression like 5K, 1MiB, etc. |
-m , --hash-method=NAME |
Customize the hash method of the delta transfer (defaults to ‘sha1’ but supports all hash methods provided by the Python hashlib module). |
-W , --whole-file |
Disable the delta transfer algorithm (skips computing of hashing and downloads all blocks unconditionally). |
-c , --concurrency=COUNT |
Change the number of parallel block hash / copy operations. |
-n , --dry-run |
Scan for differences between the source and target file and report the similarity index, but don’t write any changed blocks to the target. |
-B , --benchmark=COUNT |
Evaluate the effectiveness of delta transfer by mutating the TARGET
file (which must be a local file) and resynchronizing its contents.
This process is repeated COUNT times, with varying similarity.
At the end an overview is printed. |
-l , --listen=ADDRESS |
Listen on the specified IP:PORT or PORT. |
-v , --verbose |
Increase logging verbosity (can be repeated). |
-q , --quiet |
Decrease logging verbosity (can be repeated). |
-h , --help |
Show this message and exit. |
pdiffcopy.client
¶
Parallel, differential file copy client.
-
class
pdiffcopy.client.
Client
(**kw)[source]¶ Python API for the client side of the
pdiffcopy
program.Here’s an overview of the
Client
class:Superclass: PropertyManager
Public methods: compute_transfer_size()
,find_changes()
,mutate_target()
,run_benchmark()
,synchronize()
,synchronize_once()
andtransfer_changes()
Properties: benchmark
,block_size
,concurrency
,delta_transfer
,dry_run
,hash_method
,source
andtarget
You can set the values of the
benchmark
,block_size
,concurrency
,delta_transfer
,dry_run
,hash_method
,source
andtarget
properties by passing keyword arguments to the class initializer.-
benchmark
[source]¶ How many times the benchmark should be run (an integer, defaults to 0).
Note
The
benchmark
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
block_size
[source]¶ The block size used by the client.
Note
The
block_size
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
concurrency
[source]¶ The number of parallel processes that the client is allowed to start.
Note
The
concurrency
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
delta_transfer
[source]¶ Whether delta transfer is enabled (a boolean, defaults to
True
).Note
The
delta_transfer
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
dry_run
[source]¶ Whether the client is allowed to make changes.
Note
The
dry_run
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
hash_method
[source]¶ The block hash method (a string, defaults to ‘sha1’).
Note
The
hash_method
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
source
[source]¶ The
Location
from which data is read.Note
The
source
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
target
[source]¶ The
Location
to which data is written.Note
The
target
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
compute_transfer_size
(offsets)[source]¶ Figure out how much data we’re going to transfer.
Parameters: offsets – a list of integers with the offsets of the blocks to be synchronized. Returns: The amount of data to be transferred in bytes (an integer). This would be trivially easy if it wasn’t for the last block which can be smaller than the block size. Depending on the configured block size and the size of the file being synchronized the difference may be negligible or quite significant, so we go to the effort of calculating this correctly.
-
synchronize_once
()[source]¶ Synchronize from
source
totarget
.Returns: The number of blocks that differed (an integer).
-
find_changes
()[source]¶ Helper for
synchronize()
to compute the similarity index.
-
transfer_changes
(offsets)[source]¶ Helper for
synchronize()
to transfer the differences.Parameters: offsets – A list of integers with the byte offsets of the blocks to copy from source
totarget
.
-
-
pdiffcopy.client.
get_hashes_fn
(location, **options)[source]¶ Adapter for
multiprocessing
used byClient.find_changes()
.
-
pdiffcopy.client.
transfer_block_fn
(offset, source, target, block_size)[source]¶ Adapter for
multiprocessing
used byClient.transfer_changes()
.
-
class
pdiffcopy.client.
Location
(**kw)[source]¶ A local or remote file to be copied.
Here’s an overview of the
Location
class:Superclass: PropertyManager
Public methods: get_hashes()
,get_url()
,read_block()
,resize()
andwrite_block()
Properties: exists
,expression
,file_info
,file_size
,filename
,hostname
,label
andport_number
You can set the values of the
expression
,filename
,hostname
andport_number
properties by passing keyword arguments to the class initializer.-
exists
[source]¶ True
if the file exists,False
otherwise.Note
The
exists
property is acached_property
. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
expression
[source]¶ The location expression (a string).
Note
The
expression
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
filename
[source]¶ The absolute pathname of the file to copy (a string).
Note
The
filename
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
hostname
[source]¶ The host name of a pdiffcopy server (a string or
None
).Note
The
hostname
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
label
¶ A human friendly label for the location (a string).
-
port_number
[source]¶ The port number of a pdiffcopy server (a number or
None
).Note
The
port_number
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
file_info
[source]¶ A dictionary with file metadata.
Note
The
file_info
property is acached_property
. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
file_size
[source]¶ The size of the file in bytes (an integer).
Note
The
file_size
property is acached_property
. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
get_hashes
(**options)[source]¶ Get the hashes of the blocks in a file.
Parameters: options – See get_url()
.Returns: A generator of tokens with two values each: - A byte offset into the file (an integer).
- The hash of the block starting at that offset (a string).
-
get_url
(endpoint, **params)[source]¶ Get the server URL for the given endpoint.
Parameters: - endpoint – The name of a server side endpoint (a string).
- params – Any query string parameters.
-
read_block
(offset, size)[source]¶ Read a block of data from
filename
.Parameters: - offset – The byte offset where reading starts (an integer).
- size – The number of bytes to read (an integer).
Returns: A byte string.
-
pdiffcopy.exceptions
¶
Custom exceptions raised by the pdiffcopy
modules.
-
exception
pdiffcopy.exceptions.
ProgramError
(text, *args, **kw)[source]¶ The base exception class for all custom exceptions raised by the
pdiffcopy
modules.-
__init__
(text, *args, **kw)[source]¶ Initialize a
ProgramError
object.For argument handling see the
compact()
function. The resulting string is used as the exception message.
-
pdiffcopy.hashing
¶
Parallel hashing of files using multiprocessing
and pdiffcopy.mp
.
pdiffcopy.mp
¶
Adaptations of multiprocessing
that make it easier to do the right thing.
This module stands alone as a library used by the other modules that are
specialized to what pdiffcopy does (synchronizing files). I may end up
extracting this to a separate package at some point, because over the 10+ years
that I’ve been programming Python I’ve written an awful lot of plumbing code
for multiprocessing
and it’s not exactly my favorite thing in the world
(I suck at reasoning about concurrency, like most people I guess).
-
class
pdiffcopy.mp.
Promise
(**options)[source]¶ Execute a Python function in a child process and retrieve its return value.
-
__init__
(**options)[source]¶ Initialize a
Promise
object.The initializer arguments are the same as for
multiprocessing.Process
. The child process is started automatically.
-
-
class
pdiffcopy.mp.
WorkerPool
(**kw)[source]¶ Simple to use worker pool implementation using
multiprocessing
.Here’s an overview of the
WorkerPool
class:Superclass: PropertyManager
Special methods: __enter__()
,__exit__()
and__iter__()
Properties: all_processes
,concurrency
,generator_fn
,generator_process
,input_queue
,log_level
,output_queue
,polling_interval
,worker_fn
andworker_processes
When you initialize a
WorkerPool
object you are required to provide values for theconcurrency
,generator_fn
andworker_fn
properties. You can set the values of theconcurrency
,generator_fn
,log_level
,polling_interval
andworker_fn
properties by passing keyword arguments to the class initializer.-
all_processes
[source]¶ A list with all
multiprocessing.Process
objects used by the pool.Note
The
all_processes
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
concurrency
[source]¶ The number of processes allowed to run simultaneously (an integer).
Note
The
concurrency
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named concurrency (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
generator_fn
[source]¶ A user defined generator to populate
input_queue
.Note
The
generator_fn
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named generator_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
generator_process
[source]¶ A
multiprocessing.Process
object to rungenerator_fn
.Note
The
generator_process
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
input_queue
[source]¶ The input queue (a
multiprocessing.Queue
object).Note
The
input_queue
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
log_level
[source]¶ The logging level to configure in child processes (an integer).
Defaults to the current log level in the parent process at the point when the worker processes are created.
Note
The
log_level
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
output_queue
[source]¶ The output queue (a
multiprocessing.Queue
object).Note
The
output_queue
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
polling_interval
[source]¶ The time to wait between checking
output_queue
(a floating point number, defaults to 0.1 second).Note
The
polling_interval
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
worker_fn
[source]¶ A user defined worker function to consume
input_queue
and populateoutput_queue
.Note
The
worker_fn
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named worker_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
worker_processes
[source]¶ A list of
multiprocessing.Process
objects to runworker_fn
.Note
The
worker_processes
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
__iter__
()[source]¶ Initialize the generator and worker processes and start yielding values from the
output_queue
.
-
pdiffcopy.operations
¶
Utility functions used by the client as well as the server.
-
pdiffcopy.operations.
get_file_info
(filename)[source]¶ Get information about a local file.
Parameters: filename – An absolute filename (a string). Returns: A dictionary with file metadata, currently only the file size is included. If the file doesn’t exist an empty dictionary is returned.
-
pdiffcopy.operations.
get_file_size
(filename)[source]¶ Get the size of a local file.
Parameters: filename – An absolute filename (a string). Returns: The size of the file (an integer) or None
when the file doesn’t exist.
-
pdiffcopy.operations.
read_block
(filename, offset, size)[source]¶ Read a block of data from a local file.
Parameters: - filename – An absolute filename (a string).
- offset – The byte offset were reading starts (an integer).
- size – The number of bytes to read (an integer).
Returns: The read data (a byte string).