2.5. Files

Files can be used to store an immutable opaque sequence of bytes.

You can obtain a handle to a new or existing file object with new_dxfile() or open_dxfile(), respectively. Both return a remote file handler, which is a Python file-like object. There are also helper functions (download_dxfile(), upload_local_file(), and upload_string()) for directly downloading and uploading existing files or strings in a single operation.

Files are tristate objects:

  • When initially created, a file is in the “open” state and data can be written to it. After you have written your data to the file, call the close() method.

  • The file enters the “closing” state while it is finalized in the platform.

  • Some time later, the file enters the “closed” state and can be read.

Many methods that return a DXFile object take a mode parameter. In general the available modes are as follows (some methods create a new file and consequently do not support immediate reading from it with the “r” mode):

  • “r” (read): for read-only access. No data is expected to be written to the file.

  • “w” (write): for writing. When the object exits scope, all buffers are flushed and closing commences.

  • “a” (append): for writing. When the object exits scope, all buffers are flushed but the file is left open.

Note

The automatic flush and close operations implied by the “w” or “a” modes only happen if the DXFile object is used in a Python context-managed scope (see the following examples).

Here is an example of writing to a file object via a context-managed file handle:

# Open a file for writing
with open_dxfile('file-xxxx', mode='w') as fd:
    for line in input_file:
        fd.write(line)

The use of the context-managed file is optional for read-only objects; that is, you may use the object without a “with” block (and omit the mode parameter), for example:

# Open a file for reading
fd = open_dxfile('file-xxxx')
for line in fd:
    print line

Warning

If you write any data to a file and you choose to use a non context-managed file handle, you must call flush() or close() when you are done, for example:

# Open a file for writing; we will flush it explicitly ourselves
fd = open_dxfile('file-xxxx')
for line in input_file:
    fd.write(line)
fd.flush()

If you do not do so, and there is still unflushed data when the DXFile object is garbage collected, the DXFile will attempt to flush it then, in the destructor. However, any errors in the resulting API calls (or, in general, any exception in a destructor) are not propagated back to your program! That is, your writes can silently fail if you rely on the destructor to flush your data.

DXFile will print a warning if it detects unflushed data as the destructor is running (but again, it will attempt to flush it anyway).

Note

Writing to a file with the “w” mode calls close() but does not wait for the file to finish closing. If the file you are writing is one of the outputs of your app or applet, you can use job-based object references, which will make downstream jobs wait for closing to finish before they can begin. However, if you intend to subsequently read from the file in the same process, you will need to call wait_on_close() to ensure the file is ready to be read.

2.5.1. Helper Functions

The following helper functions are useful shortcuts for interacting with File objects.

dxpy.bindings.dxfile_functions.open_dxfile(dxid, project=None, mode=None, read_buffer_size=16777216)[source]
Parameters:

dxid (string) – file ID

Return type:

DXFile

Given the object ID of an uploaded file, returns a remote file handler that is a Python file-like object.

Example:

with open_dxfile("file-xxxx") as fd:
    for line in fd:
        ...

Note that this is shorthand for:

DXFile(dxid)
dxpy.bindings.dxfile_functions.new_dxfile(mode=None, write_buffer_size=16777216, expected_file_size=None, file_is_mmapd=False, **kwargs)[source]
Parameters:

mode (string) – One of “w” or “a” for write and append modes, respectively

Return type:

DXFile

Additional optional parameters not listed: all those under dxpy.bindings.DXDataObject.new().

Creates a new remote file object that is ready to be written to; returns a DXFile object that is a writable file-like object.

Example:

with new_dxfile(media_type="application/json") as fd:
    fd.write("foo\n")

Note that this is shorthand for:

dxFile = DXFile()
dxFile.new(**kwargs)
dxpy.bindings.dxfile_functions.download_dxfile(dxid, filename, chunksize=16777216, append=False, show_progress=False, project=None, describe_output=None, symlink_max_tries=15, **kwargs)[source]
Parameters:
  • dxid (string or DXFile) – DNAnexus file ID or DXFile (file handler) object

  • filename (string) – Local filename

  • append (boolean) – If True, appends to the local file (default is to truncate local file if it exists)

  • project (str or None) – project to use as context for this download (may affect which billing account is billed for this download). If None or DXFile.NO_PROJECT_HINT, no project hint is supplied to the API server.

  • describe_output (dict or None) – (experimental) output of the file-xxxx/describe API call, if available. It will make it possible to skip another describe API call. It should contain the default fields of the describe API call output and the “parts” field, not included in the output by default.

  • symlink_max_tries (int or None) – Maximum amount of tries when downloading a symlink with aria2c.

Downloads the remote file referenced by dxid and saves it to filename.

Example:

download_dxfile("file-xxxx", "localfilename.fastq")
dxpy.bindings.dxfile_functions.upload_local_file(filename=None, file=None, media_type=None, keep_open=False, wait_on_close=False, use_existing_dxfile=None, show_progress=False, write_buffer_size=None, multithread=True, **kwargs)[source]
Parameters:
  • filename (string) – Local filename

  • file (File-like object) – File-like object

  • media_type (string) – Internet Media Type

  • keep_open (boolean) – If False, closes the file after uploading

  • write_buffer_size (int) – Buffer size to use for upload

  • wait_on_close (boolean) – If True, waits for the file to close

  • use_existing_dxfile (DXFile) – Instead of creating a new file object, upload to the specified file

  • multithread (boolean) – If True, sends multiple write requests asynchronously

Returns:

Remote file handler

Return type:

DXFile

Additional optional parameters not listed: all those under dxpy.bindings.DXDataObject.new().

Exactly one of filename or file is required.

Uploads filename or reads from file into a new file object (with media type media_type if given) and returns the associated remote file handler. The “name” property of the newly created remote file is set to the basename of filename or to file.name (if it exists).

Examples:

# Upload from a path
dxpy.upload_local_file("/home/ubuntu/reads.fastq.gz")
# Upload from a file-like object
with open("reads.fastq") as fh:
    dxpy.upload_local_file(file=fh)
dxpy.bindings.dxfile_functions.upload_string(to_upload, media_type=None, keep_open=False, wait_on_close=False, **kwargs)[source]
Parameters:
  • to_upload (string) – String to upload into a file

  • media_type (string) – Internet Media Type

  • keep_open (boolean) – If False, closes the file after uploading

  • wait_on_close (boolean) – If True, waits for the file to close

Returns:

Remote file handler

Return type:

DXFile

Additional optional parameters not listed: all those under dxpy.bindings.DXDataObject.new().

Uploads the data in the string to_upload into a new file object (with media type media_type if given) and returns the associated remote file handler.

dxpy.bindings.dxfile_functions.list_subfolders(project, path, recurse=True)[source]
Parameters:
  • project (string) – Project ID to use as context for the listing

  • path (string) – Subtree root path

  • recurse (boolean) – Return a complete subfolders tree

Returns a list of subfolders for the remote path (included to the result) of the project.

Example:

list_subfolders("project-xxxx", folder="/input")
dxpy.bindings.dxfile_functions.download_folder(project, destdir, folder='/', overwrite=False, chunksize=16777216, show_progress=False, **kwargs)[source]
Parameters:
  • project (string) – Project ID to use as context for this download.

  • destdir (string) – Local destination location

  • folder (string) – Path to the remote folder to download

  • overwrite (boolean) – Overwrite existing files

Downloads the contents of the remote folder of the project into the local directory specified by destdir.

Example:

download_folder("project-xxxx", "/home/jsmith/input", folder="/input")

2.5.2. DXFile Handler

This remote file handler is a Python file-like object.

class dxpy.bindings.dxfile.DXFile(dxid=None, project=None, mode=None, read_buffer_size=16777216, write_buffer_size=16777216, expected_file_size=None, file_is_mmapd=False)[source]

Bases: DXDataObject

Remote file object handler.

Parameters:
  • dxid (string) – Object ID

  • project (string) – Project ID

  • mode (string) – One of “r”, “w”, or “a” for read, write, and append modes, respectively. Use “b” for binary mode. For example, “rb” means open a file for reading in binary mode.

Note

The attribute values below are current as of the last time describe() was run. (Access to any of the below attributes causes describe() to be called if it has never been called before.)

media

String containing the Internet Media Type (also known as MIME type or Content-type) of the file.

_new(dx_hash, media_type=None, **kwargs)[source]
Parameters:
  • dx_hash (dict) – Standard hash populated in dxpy.bindings.DXDataObject.new() containing attributes common to all data object classes.

  • media_type (string) – Internet Media Type

Creates a new remote file with media type media_type, if given.

Parameters:
  • dxid (string) – Object ID

  • project (string) – Project ID

  • mode (string) – One of “r”, “w”, or “a” for read, write, and append modes, respectively. Add “b” for binary mode.

  • read_buffer_size (int) – size of read buffer in bytes

  • write_buffer_size (int) – hint for size of write buffer in bytes. A lower or higher value may be used depending on region-specific parameters and on the expected file size.

  • expected_file_size (int) – size of data that will be written, if known

  • file_is_mmapd (bool) – True if input file is mmap’d (if so, the write buffer size will be constrained to be a multiple of the allocation granularity)

NO_PROJECT_HINT = 'NO_PROJECT_HINT'
classmethod set_http_threadpool_size(num_threads)[source]

Deprecated since version 0.191.0.

next(iterator[, default])

Return the next item from the iterator. If default is given and the iterator is exhausted, it is returned instead of raising StopIteration.

set_ids(dxid, project=None)[source]
Parameters:
  • dxid (string) – Object ID

  • project (string) – Project ID

Discards the currently stored ID and associates the handler with dxid. As a side effect, it also flushes the buffer for the previous file object if the buffer is nonempty.

seek(offset, from_what=0)[source]
Parameters:

offset (integer) – Position in the file to seek to

Seeks to offset bytes from the beginning of the file. This is a no-op if the file is open for writing.

The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.

tell()[source]

Returns the current position of the file read cursor.

Warning: Because of buffering semantics, this value will not be accurate when using the line iterator form (for line in file).

flush(multithread=True, **kwargs)[source]

Flushes the internal write buffer.

write(data, multithread=True, **kwargs)[source]
Parameters:
  • data (str or mmap object) – Data to be written

  • multithread (boolean) – If True, sends multiple write requests asynchronously

Writes the data data to the file.

Note

Writing to remote files is append-only. Using seek() does not affect where the next write() will occur.

closed(**kwargs)[source]
Returns:

Whether the remote file is closed

Return type:

boolean

Returns True if the remote file is closed and False otherwise. Note that if it is not closed, it can be in either the “open” or “closing” states.

close(block=False, **kwargs)[source]
Parameters:

block (boolean) – If True, this function blocks until the remote file has closed.

Attempts to close the file.

Note

The remote file cannot be closed until all parts have been fully uploaded. An exception will be thrown if this is not the case.

wait_on_close(timeout=604800, **kwargs)[source]
Parameters:

timeout (integer) – Maximum amount of time to wait (in seconds) until the file is closed.

Raises:

dxpy.exceptions.DXFileError if the timeout is reached before the remote file has been closed

Waits until the remote file is closed.

upload_part(data, index=None, display_progress=False, report_progress_fn=None, **kwargs)[source]
Parameters:
  • data (str or mmap object, bytes on python3) – Data to be uploaded in this part

  • index (integer) – Index of part to be uploaded; must be in [1, 10000]

  • display_progress (boolean) – Whether to print “.” to stderr when done

  • report_progress_fn (function or None) – Optional: a function to call that takes in two arguments (self, # bytes transmitted)

Raises:

dxpy.exceptions.DXFileError if index is given and is not in the correct range, urllib3.exceptions.HTTPError if upload fails

Uploads the data in data as part number index for the associated file. If no value for index is given, index defaults to 1. This probably only makes sense if this is the only part to be uploaded.

wait_until_parts_uploaded(**kwargs)[source]
get_download_url(duration=None, preauthenticated=False, filename=None, project=None, **kwargs)[source]
Parameters:
  • duration (int) – number of seconds for which the generated URL will be valid, should only be specified when preauthenticated is True

  • preauthenticated (bool) – if True, generates a ‘preauthenticated’ download URL, which embeds authentication info in the URL and does not require additional headers

  • filename (str) – desired filename of the downloaded file

  • project (str) – ID of a project containing the file (the download URL will be associated with this project, and this may affect which billing account is billed for this download). If no project is specified, an attempt will be made to verify if the file is in the project from the DXFile handler (as specified by the user or the current project stored in dxpy.WORKSPACE_ID). Otherwise, no hint is supplied. This fall back behavior does not happen inside a job environment. A non preauthenticated URL is only valid as long as the user has access to that project and the project contains that file.

Returns:

download URL and dict containing HTTP headers to be supplied with the request

Return type:

tuple (str, dict)

Raises:

ResourceNotFound if a project context was given and the file was not found in that project context.

Raises:

ResourceNotFound if no project context was given and the file was not found in any projects.

Obtains a URL that can be used to directly download the associated file.

read(length=None, use_compression=None, project=None, **kwargs)[source]
archive(all_copies=False)[source]
Parameters:

all_copies (boolean) – Force the transition of files into the archived state. Requesting user must be the ADMIN of the project billTo org. If true, archive all the copies of files in projects with the same billTo org.

Raises:

InvalidState if the file is not in a live state

Raises:

PermissionDenied if the requesting user does not have CONTRIBUTE access or is not an ADMIN of the project billTo org with allCopies=True.

unarchive(dry_run=False)[source]
Parameters:

dry_run (boolean) – If true, only display the output of the API call without executing the unarchival

Raises:

InvalidState if the file is not in a closed or archived state

Raises:

PermissionDenied if the requesting user does not have CONTRIBUTE access