API documentation

This documentation is based on the source code of version 1.0.1 of the pdiffcopy package. The following modules are available:

pdiffcopy

Configuration defaults for the pdiffcopy program.

pdiffcopy.BLOCK_SIZE = 1048576

The default block size to be used by pdiffcopy (1 MiB).

pdiffcopy.DEFAULT_CONCURRENCY = 2

The default concurrency to be used by pdiffcopy (at least two, at most 1/3 of available cores).

pdiffcopy.DEFAULT_PORT = 8080

The default port number for the pdiffcopy server (an integer number, defaults to 8080).

pdiffcopy.cli

Usage: pdiffcopy [OPTIONS] [SOURCE, TARGET]

Synchronize large binary data files between Linux servers at blazing speeds by performing delta transfers and spreading the work over many CPU cores.

One of the SOURCE and TARGET arguments is expected to be the pathname of a local file and the other argument is expected to be a URL that provides the location of a remote pdiffcopy server and a remote filename. File data will be read from SOURCE and written to TARGET.

If no positional arguments are given the server is started.

Supported options:

Option Description
-b, --block-size=BYTES Customize the block size of the delta transfer. Can be a plain integer number (bytes) or an expression like 5K, 1MiB, etc.
-m, --hash-method=NAME Customize the hash method of the delta transfer (defaults to ‘sha1’ but supports all hash methods provided by the Python hashlib module).
-W, --whole-file Disable the delta transfer algorithm (skips computing of hashing and downloads all blocks unconditionally).
-c, --concurrency=COUNT Change the number of parallel block hash / copy operations.
-n, --dry-run Scan for differences between the source and target file and report the similarity index, but don’t write any changed blocks to the target.
-B, --benchmark=COUNT Evaluate the effectiveness of delta transfer by mutating the TARGET file (which must be a local file) and resynchronizing its contents. This process is repeated COUNT times, with varying similarity. At the end an overview is printed.
-l, --listen=ADDRESS Listen on the specified IP:PORT or PORT.
-v, --verbose Increase logging verbosity (can be repeated).
-q, --quiet Decrease logging verbosity (can be repeated).
-h, --help Show this message and exit.
pdiffcopy.cli.main()[source]

The command line interface.

pdiffcopy.cli.run_client(**options)[source]

Run the client program.

pdiffcopy.cli.run_server(**options)[source]

Run the server program.

pdiffcopy.client

Parallel, differential file copy client.

class pdiffcopy.client.Client(**kw)[source]

Python API for the client side of the pdiffcopy program.

Here’s an overview of the Client class:

Superclass: PropertyManager
Public methods: compute_transfer_size(), find_changes(), mutate_target(), run_benchmark(), synchronize(), synchronize_once() and transfer_changes()
Properties: benchmark, block_size, concurrency, delta_transfer, dry_run, hash_method, source and target

You can set the values of the benchmark, block_size, concurrency, delta_transfer, dry_run, hash_method, source and target properties by passing keyword arguments to the class initializer.

benchmark[source]

How many times the benchmark should be run (an integer, defaults to 0).

Note

The benchmark property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

block_size[source]

The block size used by the client.

Note

The block_size property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

concurrency[source]

The number of parallel processes that the client is allowed to start.

Note

The concurrency property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

delta_transfer[source]

Whether delta transfer is enabled (a boolean, defaults to True).

Note

The delta_transfer property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

dry_run[source]

Whether the client is allowed to make changes.

Note

The dry_run property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

hash_method[source]

The block hash method (a string, defaults to ‘sha1’).

Note

The hash_method property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

source[source]

The Location from which data is read.

Note

The source property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

target[source]

The Location to which data is written.

Note

The target property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

compute_transfer_size(offsets)[source]

Figure out how much data we’re going to transfer.

Parameters:offsets – a list of integers with the offsets of the blocks to be synchronized.
Returns:The amount of data to be transferred in bytes (an integer).

This would be trivially easy if it wasn’t for the last block which can be smaller than the block size. Depending on the configured block size and the size of the file being synchronized the difference may be negligible or quite significant, so we go to the effort of calculating this correctly.

mutate_target(percentage)[source]

Invalidate a percentage of the data in the target file.

run_benchmark()[source]

Benchmark the effectiveness of the delta transfer implementation.

synchronize()[source]

Synchronize from source to target (possibly more than once, see benchmark).

synchronize_once()[source]

Synchronize from source to target.

Returns:The number of blocks that differed (an integer).
find_changes()[source]

Helper for synchronize() to compute the similarity index.

transfer_changes(offsets)[source]

Helper for synchronize() to transfer the differences.

Parameters:offsets – A list of integers with the byte offsets of the blocks to copy from source to target.
pdiffcopy.client.get_hashes_fn(location, **options)[source]

Adapter for multiprocessing used by Client.find_changes().

pdiffcopy.client.transfer_block_fn(offset, source, target, block_size)[source]

Adapter for multiprocessing used by Client.transfer_changes().

class pdiffcopy.client.Location(**kw)[source]

A local or remote file to be copied.

Here’s an overview of the Location class:

Superclass: PropertyManager
Public methods: get_hashes(), get_url(), read_block(), resize() and write_block()
Properties: exists, expression, file_info, file_size, filename, hostname, label and port_number

You can set the values of the expression, filename, hostname and port_number properties by passing keyword arguments to the class initializer.

exists[source]

True if the file exists, False otherwise.

Note

The exists property is a cached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can use del or delattr().

expression[source]

The location expression (a string).

Note

The expression property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

filename[source]

The absolute pathname of the file to copy (a string).

Note

The filename property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

hostname[source]

The host name of a pdiffcopy server (a string or None).

Note

The hostname property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

label

A human friendly label for the location (a string).

port_number[source]

The port number of a pdiffcopy server (a number or None).

Note

The port_number property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

file_info[source]

A dictionary with file metadata.

Note

The file_info property is a cached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can use del or delattr().

file_size[source]

The size of the file in bytes (an integer).

Note

The file_size property is a cached_property. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can use del or delattr().

get_hashes(**options)[source]

Get the hashes of the blocks in a file.

Parameters:options – See get_url().
Returns:A generator of tokens with two values each:
  1. A byte offset into the file (an integer).
  2. The hash of the block starting at that offset (a string).
get_url(endpoint, **params)[source]

Get the server URL for the given endpoint.

Parameters:
  • endpoint – The name of a server side endpoint (a string).
  • params – Any query string parameters.
read_block(offset, size)[source]

Read a block of data from filename.

Parameters:
  • offset – The byte offset where reading starts (an integer).
  • size – The number of bytes to read (an integer).
Returns:

A byte string.

resize(size)[source]

Adjust the size of filename to the given size.

Parameters:size – The new file size in bytes (an integer).
write_block(offset, data)[source]

Write a block of data to filename.

Parameters:
  • offset – The byte offset where writing starts (an integer).
  • data – The byte string to write to the file.

pdiffcopy.exceptions

Custom exceptions raised by the pdiffcopy modules.

exception pdiffcopy.exceptions.ProgramError(text, *args, **kw)[source]

The base exception class for all custom exceptions raised by the pdiffcopy modules.

__init__(text, *args, **kw)[source]

Initialize a ProgramError object.

For argument handling see the compact() function. The resulting string is used as the exception message.

exception pdiffcopy.exceptions.BenchmarkAbortedError(text, *args, **kw)[source]

Raised when the operator doesn’t give explicit permission to run the benchmark.

exception pdiffcopy.exceptions.DependencyError(text, *args, **kw)[source]

Raised when client or server installation requirements are missing.

pdiffcopy.hashing

Parallel hashing of files using multiprocessing and pdiffcopy.mp.

pdiffcopy.hashing.compute_hashes(filename, block_size, method, concurrency)[source]

Compute checksums of a file in blocks (parallel).

pdiffcopy.hashing.hash_worker(offset, block_size, filename, method)[source]

Worker function to be run in child processes.

pdiffcopy.mp

Adaptations of multiprocessing that make it easier to do the right thing.

This module stands alone as a library used by the other modules that are specialized to what pdiffcopy does (synchronizing files). I may end up extracting this to a separate package at some point, because over the 10+ years that I’ve been programming Python I’ve written an awful lot of plumbing code for multiprocessing and it’s not exactly my favorite thing in the world (I suck at reasoning about concurrency, like most people I guess).

class pdiffcopy.mp.Promise(**options)[source]

Execute a Python function in a child process and retrieve its return value.

__init__(**options)[source]

Initialize a Promise object.

The initializer arguments are the same as for multiprocessing.Process. The child process is started automatically.

run()[source]

Run the target function in a newly spawned child process.

join()[source]

Get the return value and wait for the child process to finish.

class pdiffcopy.mp.WorkerPool(**kw)[source]

Simple to use worker pool implementation using multiprocessing.

Here’s an overview of the WorkerPool class:

Superclass: PropertyManager
Special methods: __enter__(), __exit__() and __iter__()
Properties: all_processes, concurrency, generator_fn, generator_process, input_queue, log_level, output_queue, polling_interval, worker_fn and worker_processes

When you initialize a WorkerPool object you are required to provide values for the concurrency, generator_fn and worker_fn properties. You can set the values of the concurrency, generator_fn, log_level, polling_interval and worker_fn properties by passing keyword arguments to the class initializer.

all_processes[source]

A list with all multiprocessing.Process objects used by the pool.

Note

The all_processes property is a lazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.

concurrency[source]

The number of processes allowed to run simultaneously (an integer).

Note

The concurrency property is a required_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named concurrency (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.

generator_fn[source]

A user defined generator to populate input_queue.

Note

The generator_fn property is a required_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named generator_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.

generator_process[source]

A multiprocessing.Process object to run generator_fn.

Note

The generator_process property is a lazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.

input_queue[source]

The input queue (a multiprocessing.Queue object).

Note

The input_queue property is a lazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.

log_level[source]

The logging level to configure in child processes (an integer).

Defaults to the current log level in the parent process at the point when the worker processes are created.

Note

The log_level property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

output_queue[source]

The output queue (a multiprocessing.Queue object).

Note

The output_queue property is a lazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.

polling_interval[source]

The time to wait between checking output_queue (a floating point number, defaults to 0.1 second).

Note

The polling_interval property is a mutable_property. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can use del or delattr().

worker_fn[source]

A user defined worker function to consume input_queue and populate output_queue.

Note

The worker_fn property is a required_property. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named worker_fn (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.

worker_processes[source]

A list of multiprocessing.Process objects to run worker_fn.

Note

The worker_processes property is a lazy_property. This property’s value is computed once (the first time it is accessed) and the result is cached.

__iter__()[source]

Initialize the generator and worker processes and start yielding values from the output_queue.

__enter__()[source]

Start up the generator and worker processes.

__exit__(exc_type=None, exc_value=None, traceback=None)[source]

Terminate any child processes that are still alive.

pdiffcopy.mp.generator_adapter(concurrency, generator_fn, input_queue, log_level)[source]

Adapter function for the generator process.

pdiffcopy.mp.worker_adapter(input_queue, log_level, output_queue, worker_fn)[source]

Adapter function for the worker processes.

pdiffcopy.operations

Utility functions used by the client as well as the server.

pdiffcopy.operations.get_file_info(filename)[source]

Get information about a local file.

Parameters:filename – An absolute filename (a string).
Returns:A dictionary with file metadata, currently only the file size is included. If the file doesn’t exist an empty dictionary is returned.
pdiffcopy.operations.get_file_size(filename)[source]

Get the size of a local file.

Parameters:filename – An absolute filename (a string).
Returns:The size of the file (an integer) or None when the file doesn’t exist.
pdiffcopy.operations.read_block(filename, offset, size)[source]

Read a block of data from a local file.

Parameters:
  • filename – An absolute filename (a string).
  • offset – The byte offset were reading starts (an integer).
  • size – The number of bytes to read (an integer).
Returns:

The read data (a byte string).

pdiffcopy.operations.resize_file(filename, size)[source]

Create or resize a local file, in preparation for synchronizing its contents.

Parameters:
  • filename – An absolute filename (a string).
  • size – The new size in bytes (an integer).
pdiffcopy.operations.write_block(filename, offset, data)[source]

Write a block of data to a local file.

Parameters:
  • filename – An absolute filename (a string).
  • offset – The byte offset were writing starts (an integer).
  • data – The data to write (a byte string).