2.5. Files¶
Files can be used to store an immutable opaque sequence of bytes.
You can obtain a handle to a new or existing file object with
new_dxfile()
or
open_dxfile()
, respectively. Both
return a remote file handler, which is a Python file-like object. There
are also helper functions
(download_dxfile()
,
upload_local_file()
, and
upload_string()
) for directly
downloading and uploading existing files or strings in a single
operation.
Files are tristate objects:
When initially created, a file is in the “open” state and data can be written to it. After you have written your data to the file, call the
close()
method.The file enters the “closing” state while it is finalized in the platform.
Some time later, the file enters the “closed” state and can be read.
Many methods that return a DXFile
object
take a mode parameter. In general the available modes are as follows
(some methods create a new file and consequently do not support
immediate reading from it with the “r” mode):
“r” (read): for read-only access. No data is expected to be written to the file.
“w” (write): for writing. When the object exits scope, all buffers are flushed and closing commences.
“a” (append): for writing. When the object exits scope, all buffers are flushed but the file is left open.
Note
The automatic flush and close operations implied by the “w” or
“a” modes only happen if the
DXFile
object is used in a Python
context-managed scope (see the following examples).
Here is an example of writing to a file object via a context-managed file handle:
# Open a file for writing
with open_dxfile('file-xxxx', mode='w') as fd:
for line in input_file:
fd.write(line)
The use of the context-managed file is optional for read-only objects; that is, you may use the object without a “with” block (and omit the mode parameter), for example:
# Open a file for reading
fd = open_dxfile('file-xxxx')
for line in fd:
print line
Warning
If you write any data to a file and you choose to use a non
context-managed file handle, you must call
flush()
or
close()
when you are done, for
example:
# Open a file for writing; we will flush it explicitly ourselves
fd = open_dxfile('file-xxxx')
for line in input_file:
fd.write(line)
fd.flush()
If you do not do so, and there is still unflushed data when the
DXFile
object is garbage collected,
the DXFile
will attempt to flush it
then, in the destructor. However, any errors in the resulting API
calls (or, in general, any exception in a destructor) are not
propagated back to your program! That is, your writes can silently
fail if you rely on the destructor to flush your data.
DXFile
will print a warning if it
detects unflushed data as the destructor is running (but again, it
will attempt to flush it anyway).
Note
Writing to a file with the “w” mode calls
close()
but does not wait for the
file to finish closing. If the file you are writing is one of the
outputs of your app or applet, you can use job-based object
references,
which will make downstream jobs wait for closing to finish before
they can begin. However, if you intend to subsequently read from the
file in the same process, you will need to call
wait_on_close()
to ensure the file
is ready to be read.
2.5.1. Helper Functions¶
The following helper functions are useful shortcuts for interacting with File objects.
- dxpy.bindings.dxfile_functions.open_dxfile(dxid, project=None, mode=None, read_buffer_size=16777216)[source]¶
- Parameters:
dxid (string) – file ID
- Return type:
Given the object ID of an uploaded file, returns a remote file handler that is a Python file-like object.
Example:
with open_dxfile("file-xxxx") as fd: for line in fd: ...
Note that this is shorthand for:
DXFile(dxid)
- dxpy.bindings.dxfile_functions.new_dxfile(mode=None, write_buffer_size=16777216, expected_file_size=None, file_is_mmapd=False, **kwargs)[source]¶
- Parameters:
mode (string) – One of “w” or “a” for write and append modes, respectively
- Return type:
Additional optional parameters not listed: all those under
dxpy.bindings.DXDataObject.new()
.Creates a new remote file object that is ready to be written to; returns a
DXFile
object that is a writable file-like object.Example:
with new_dxfile(media_type="application/json") as fd: fd.write("foo\n")
Note that this is shorthand for:
dxFile = DXFile() dxFile.new(**kwargs)
- dxpy.bindings.dxfile_functions.download_dxfile(dxid, filename, chunksize=16777216, append=False, show_progress=False, project=None, describe_output=None, symlink_max_tries=15, **kwargs)[source]¶
- Parameters:
dxid (string or DXFile) – DNAnexus file ID or DXFile (file handler) object
filename (string) – Local filename
append (boolean) – If True, appends to the local file (default is to truncate local file if it exists)
project (str or None) – project to use as context for this download (may affect which billing account is billed for this download). If None or DXFile.NO_PROJECT_HINT, no project hint is supplied to the API server.
describe_output (dict or None) – (experimental) output of the file-xxxx/describe API call, if available. It will make it possible to skip another describe API call. It should contain the default fields of the describe API call output and the “parts” field, not included in the output by default.
symlink_max_tries (int or None) – Maximum amount of tries when downloading a symlink with aria2c.
Downloads the remote file referenced by dxid and saves it to filename.
Example:
download_dxfile("file-xxxx", "localfilename.fastq")
- dxpy.bindings.dxfile_functions.upload_local_file(filename=None, file=None, media_type=None, keep_open=False, wait_on_close=False, use_existing_dxfile=None, show_progress=False, write_buffer_size=None, multithread=True, **kwargs)[source]¶
- Parameters:
filename (string) – Local filename
file (File-like object) – File-like object
media_type (string) – Internet Media Type
keep_open (boolean) – If False, closes the file after uploading
write_buffer_size (int) – Buffer size to use for upload
wait_on_close (boolean) – If True, waits for the file to close
use_existing_dxfile (
DXFile
) – Instead of creating a new file object, upload to the specified filemultithread (boolean) – If True, sends multiple write requests asynchronously
- Returns:
Remote file handler
- Return type:
Additional optional parameters not listed: all those under
dxpy.bindings.DXDataObject.new()
.Exactly one of filename or file is required.
Uploads filename or reads from file into a new file object (with media type media_type if given) and returns the associated remote file handler. The “name” property of the newly created remote file is set to the basename of filename or to file.name (if it exists).
Examples:
# Upload from a path dxpy.upload_local_file("/home/ubuntu/reads.fastq.gz") # Upload from a file-like object with open("reads.fastq") as fh: dxpy.upload_local_file(file=fh)
- dxpy.bindings.dxfile_functions.upload_string(to_upload, media_type=None, keep_open=False, wait_on_close=False, **kwargs)[source]¶
- Parameters:
to_upload (string) – String to upload into a file
media_type (string) – Internet Media Type
keep_open (boolean) – If False, closes the file after uploading
wait_on_close (boolean) – If True, waits for the file to close
- Returns:
Remote file handler
- Return type:
Additional optional parameters not listed: all those under
dxpy.bindings.DXDataObject.new()
.Uploads the data in the string to_upload into a new file object (with media type media_type if given) and returns the associated remote file handler.
- dxpy.bindings.dxfile_functions.list_subfolders(project, path, recurse=True)[source]¶
- Parameters:
project (string) – Project ID to use as context for the listing
path (string) – Subtree root path
recurse (boolean) – Return a complete subfolders tree
Returns a list of subfolders for the remote path (included to the result) of the project.
Example:
list_subfolders("project-xxxx", folder="/input")
- dxpy.bindings.dxfile_functions.download_folder(project, destdir, folder='/', overwrite=False, chunksize=16777216, show_progress=False, **kwargs)[source]¶
- Parameters:
project (string) – Project ID to use as context for this download.
destdir (string) – Local destination location
folder (string) – Path to the remote folder to download
overwrite (boolean) – Overwrite existing files
Downloads the contents of the remote folder of the project into the local directory specified by destdir.
Example:
download_folder("project-xxxx", "/home/jsmith/input", folder="/input")
2.5.2. DXFile Handler¶
This remote file handler is a Python file-like object.
- class dxpy.bindings.dxfile.DXFile(dxid=None, project=None, mode=None, read_buffer_size=16777216, write_buffer_size=16777216, expected_file_size=None, file_is_mmapd=False)[source]¶
Bases:
DXDataObject
Remote file object handler.
- Parameters:
dxid (string) – Object ID
project (string) – Project ID
mode (string) – One of “r”, “w”, or “a” for read, write, and append modes, respectively. Use “b” for binary mode. For example, “rb” means open a file for reading in binary mode.
Note
The attribute values below are current as of the last time
describe()
was run. (Access to any of the below attributes causesdescribe()
to be called if it has never been called before.)- media¶
String containing the Internet Media Type (also known as MIME type or Content-type) of the file.
- _new(dx_hash, media_type=None, **kwargs)[source]¶
- Parameters:
dx_hash (dict) – Standard hash populated in
dxpy.bindings.DXDataObject.new()
containing attributes common to all data object classes.media_type (string) – Internet Media Type
Creates a new remote file with media type media_type, if given.
- Parameters:
dxid (string) – Object ID
project (string) – Project ID
mode (string) – One of “r”, “w”, or “a” for read, write, and append modes, respectively. Add “b” for binary mode.
read_buffer_size (int) – size of read buffer in bytes
write_buffer_size (int) – hint for size of write buffer in bytes. A lower or higher value may be used depending on region-specific parameters and on the expected file size.
expected_file_size (int) – size of data that will be written, if known
file_is_mmapd (bool) – True if input file is mmap’d (if so, the write buffer size will be constrained to be a multiple of the allocation granularity)
- NO_PROJECT_HINT = 'NO_PROJECT_HINT'¶
- next(iterator[, default])¶
Return the next item from the iterator. If default is given and the iterator is exhausted, it is returned instead of raising StopIteration.
- set_ids(dxid, project=None)[source]¶
- Parameters:
dxid (string) – Object ID
project (string) – Project ID
Discards the currently stored ID and associates the handler with dxid. As a side effect, it also flushes the buffer for the previous file object if the buffer is nonempty.
- seek(offset, from_what=0)[source]¶
- Parameters:
offset (integer) – Position in the file to seek to
Seeks to offset bytes from the beginning of the file. This is a no-op if the file is open for writing.
The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point.
- tell()[source]¶
Returns the current position of the file read cursor.
Warning: Because of buffering semantics, this value will not be accurate when using the line iterator form (for line in file).
- write(data, multithread=True, **kwargs)[source]¶
- Parameters:
data (str or mmap object) – Data to be written
multithread (boolean) – If True, sends multiple write requests asynchronously
Writes the data data to the file.
- closed(**kwargs)[source]¶
- Returns:
Whether the remote file is closed
- Return type:
boolean
Returns
True
if the remote file is closed andFalse
otherwise. Note that if it is not closed, it can be in either the “open” or “closing” states.
- close(block=False, **kwargs)[source]¶
- Parameters:
block (boolean) – If True, this function blocks until the remote file has closed.
Attempts to close the file.
Note
The remote file cannot be closed until all parts have been fully uploaded. An exception will be thrown if this is not the case.
- wait_on_close(timeout=604800, **kwargs)[source]¶
- Parameters:
timeout (integer) – Maximum amount of time to wait (in seconds) until the file is closed.
- Raises:
dxpy.exceptions.DXFileError
if the timeout is reached before the remote file has been closed
Waits until the remote file is closed.
- upload_part(data, index=None, display_progress=False, report_progress_fn=None, **kwargs)[source]¶
- Parameters:
data (str or mmap object, bytes on python3) – Data to be uploaded in this part
index (integer) – Index of part to be uploaded; must be in [1, 10000]
display_progress (boolean) – Whether to print “.” to stderr when done
report_progress_fn (function or None) – Optional: a function to call that takes in two arguments (self, # bytes transmitted)
- Raises:
dxpy.exceptions.DXFileError
if index is given and is not in the correct range,urllib3.exceptions.HTTPError
if upload fails
Uploads the data in data as part number index for the associated file. If no value for index is given, index defaults to 1. This probably only makes sense if this is the only part to be uploaded.
- get_download_url(duration=None, preauthenticated=False, filename=None, project=None, **kwargs)[source]¶
- Parameters:
duration (int) – number of seconds for which the generated URL will be valid, should only be specified when preauthenticated is True
preauthenticated (bool) – if True, generates a ‘preauthenticated’ download URL, which embeds authentication info in the URL and does not require additional headers
filename (str) – desired filename of the downloaded file
project (str) – ID of a project containing the file (the download URL will be associated with this project, and this may affect which billing account is billed for this download). If no project is specified, an attempt will be made to verify if the file is in the project from the DXFile handler (as specified by the user or the current project stored in dxpy.WORKSPACE_ID). Otherwise, no hint is supplied. This fall back behavior does not happen inside a job environment. A non preauthenticated URL is only valid as long as the user has access to that project and the project contains that file.
- Returns:
download URL and dict containing HTTP headers to be supplied with the request
- Return type:
tuple (str, dict)
- Raises:
ResourceNotFound
if a project context was given and the file was not found in that project context.- Raises:
ResourceNotFound
if no project context was given and the file was not found in any projects.
Obtains a URL that can be used to directly download the associated file.
- archive(all_copies=False)[source]¶
- Parameters:
all_copies (boolean) – Force the transition of files into the archived state. Requesting user must be the ADMIN of the project billTo org. If true, archive all the copies of files in projects with the same billTo org.
- Raises:
InvalidState
if the file is not in a live state- Raises:
PermissionDenied
if the requesting user does not have CONTRIBUTE access or is not an ADMIN of the project billTo org with allCopies=True.
- unarchive(dry_run=False)[source]¶
- Parameters:
dry_run (boolean) – If true, only display the output of the API call without executing the unarchival
- Raises:
InvalidState
if the file is not in a closed or archived state- Raises:
PermissionDenied
if the requesting user does not have CONTRIBUTE access