File parsers¶
File parsers.
Table of Contents
XML based file formats¶
Specs File Format¶
This file is used to parse XPS and ISS data from XML files from the SPECS program.
In this file format the spectra (called regions) are containd in region groups inside the files. This structure is mirrored in the data structure below where classes are provided for the 3 top level objects:
Files -> Region Groups -> Regions
The parser is strict, in the sense that it will throw an exception if it encounters anything it does not understand. To change this behavior set the EXCEPTION_ON_UNHANDLED module variable to False.
Usage examples¶
To use the file parse, simply feed the top level data structure a path to a data file and start to use it:
from PyExpLabSys.file_parsers.specs import SpecsFile
import matplotlib.pyplot as plt
file_ = SpecsFile('path_to_my_xps_file.xml')
# Access the regions groups by iteration
for region_group in file_:
print '{} regions groups in region group: {}'.format(
len(region_group), region_group.name)
# or by index
region_group = file_[0]
# And again access regions by iteration
for region in region_group:
print 'region: {}'.format(region.name)
# or by index
region = region_group[0]
# or you can search for them from the file level
region = list(file_.search_regions('Mo'))[0]
print region
# NOTE the search_regions method returns a generator of results, hence the
# conversion to list and subsequent indexing
# From the regions, the x data can be accessed either as kinetic
# or binding energy (for XPS only) and the y data can be accessed
# as averages of the counts, either as pure count numbers or as
# counts per second. These options works independently of each
# other.
# counts as function of kinetic energy
plt.plot(region.x, region.y_avg_counts)
plt.show()
# cps as function of binding energy
plt.plot(region.x_be, region.y_avg_cps)
plt.show()
# Files also have a useful str representation that shows the hierachi
print file_
Notes
The file format seems to basically be a dump, of a large low level data structure from the implementation language. With an appropriate mapping of low level data structure types to python types (see details below and in the simple_convert function), this data structure could have been mapped in its entirety to python types, but in order to provide a more clear data structure a more object oriented approach has been taken, where the top most level data structures are implemented as classes. Inside of these classes, the data is parsed into numpy arrays and the remaining low level data structures are parsed in python data structures with the simple_convert function.
Module Documentation¶
-
PyExpLabSys.file_parsers.specs.
simple_convert
(element)[source]¶ Converts a XML data structure to pure python types.
Parameters: element (xml.etree.ElementTree.Element) – The XML element to convert Returns: A hierachi of python data structure Return type: object Simple element types are converted as follows:
XML type | Python type string str ulong long double float boolean bool struct dict sequence list Arrays are converted to numpy arrays, wherein the type conversion is:
XML type | Python type ulong numpy.uint64 double numpy.double Besides these types there are a few special elements that have a custom conversion.
- Enum are simply converted into their value, since enums are considered to be a program implementation detail whose information is not relavant for a data file parser
- Any is skipped and replaced with its content
-
class
PyExpLabSys.file_parsers.specs.
SpecsFile
(filepath, encoding=None)[source]¶ Bases:
list
This is the top structure for a parsed file which represents a list of RegionGroups
The class contains a ‘filepath’ attribute.
-
regions_iter
¶ Returns a iteration over the regions
-
search_regions_iter
(search_term)[source]¶ Returns an generator of search results for regions by name
Parameters: search_term (str) – The term to search for (case sensitively) Returns: An iterator of maching regions Return type: generator
-
search_regions
(search_term)[source]¶ Returns an list of search results for regions by name
Parameters: search_term (str) – The term to search for (case sensitively) Returns: A list of matching regions Return type: list
-
unix_timestamp
¶ Returns the unix timestamp of the first region
-
get_analysis_method
()[source]¶ Returns the analysis method of the file
Raises: ValueError
– If more than one analysis method is used
-
-
class
PyExpLabSys.file_parsers.specs.
RegionGroup
(xml)[source]¶ Bases:
list
Class that represents a region group, which consist of a list of regions
The class contains a ‘name’ and and ‘parameters’ attribute.
-
class
PyExpLabSys.file_parsers.specs.
Region
(xml)[source]¶ Bases:
object
Class that represents a region
The class contains attributes for the items listed in the ‘information_names’ class variable.
- Some useful ones are:
- name: The name of the region
- region: Contains information like, dwell_time, analysis_method, scan_delta, excitation_energy etc.
All auxiliary information is also available from the ‘info’ attribute.
-
__init__
(xml)[source]¶ Parse the XML and initialize internal variables
Parameters: xml (xml.etree.ElementTree.Element) – The region XML element
-
x
¶ Returns the kinetic energy x-values as a Numpy array
-
x_be
¶ Returns the binding energy x-values as a Numpy array
-
iter_cycles
¶ Returns a generator of cycles
Each cycle is in itself a generator of lists of scans. To iterate over single scans do:
for cycle in self.iter_cycles: for scans in cycle: for scan in scans: print scan
or use
iter_scans
, which do just that.
-
iter_scans
¶ Returns an generator of single scans, which in themselves are Numpy arrays
-
y_avg_counts
¶ Returns the average counts as a Numpy array
-
y_avg_cps
¶ Returns the average counts per second as a Numpy array
-
unix_timestamp
¶ Returns the unix timestamp of the first cycle
Binary File Formats¶
File parser for Chemstation files
Copyright (C) 2015-2018 CINF team on GitHub: https://github.com/CINF
The General Stepped Program Runner is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
The General Stepped Program Runner is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with The CINF Data Presentation Website. If not, see <http://www.gnu.org/licenses/>.
Note
This file parser went through a large re-write on ??? which changed the data structures of the resulting objects. This means that upon upgrading it will be necessary to update code. The re-write was done to fix some serious errors from the first version, like relying on the Report.TXT file for injections summaries. These are now fetched from the more ordered CSV files.
-
exception
PyExpLabSys.file_parsers.chemstation.
NoInjections
[source]¶ Bases:
Exception
Exception raised when there are no injections in the sequence
-
class
PyExpLabSys.file_parsers.chemstation.
Sequence
(sequence_dir_path)[source]¶ Bases:
object
The Sequence class for the Chemstation data format
Parameters: -
__init__
(sequence_dir_path)[source]¶ Instantiate object properties
Parameters: sequence_dir_path (str) – The path of the sequence
-
full_sequence_dataset
(column_names=None)[source]¶ Generate peak name specific dataset
This will collect area values for named peaks as a function of time over the different injections.
Parameters: column_names (dict) – A dict of the column names needed from the report lines. The dict should hold the keys: ‘peak_name’, ‘retention_time’ and ‘area’. It defaults to: column_names = {‘peak_name’: ‘Compound Name’, ‘retention_time’: ‘Retention Timemin’, ‘area’: ‘Area’} Returns: Mapping of signal_and_peak names and the values Return type: dict
-
-
class
PyExpLabSys.file_parsers.chemstation.
Injection
(injection_dirpath, load_raw_spectra=True, read_report_txt=True)[source]¶ Bases:
object
The Injection class for the Chemstation data format
Parameters: - injection_dirpath (str) – The path of the directory of this injection
- reports (defaultdict) –
Signal -> list_of_report_lines dict. Each report line is dict of column headers to type converted column content. E.g:
{u'Area': 22.81, u'Area %': 0.24, u'Height': 12.66, u'Peak Number': 1, u'Peak Type': u'BB', u'Peak Widthmin': 0.027, u'Retention Timemin': 5.81}
The columns headers are also stored in :attr`~metadata` under the columns key.
- reports_raw (defaultdict) – Same as
reports
except the content is not type converted. - metadata (dict) – Dict of metadata
- raw_files (dict) – Mapping of ch_file_name ->
CHFile
objects - report_txt (str or None) – The content of the Report.TXT file from the injection folder is any
-
PyExpLabSys.file_parsers.chemstation.
parse_utf16_string
(file_, encoding='UTF16')[source]¶ Parse a pascal type UTF16 encoded string from a binary file object
-
class
PyExpLabSys.file_parsers.chemstation.
CHFile
(filepath)[source]¶ Bases:
object
Class that implementats the Agilent .ch file format version 179
Warning
Not all aspects of the file header is understood, so there may and probably is information that is not parsed. See the method
_parse_header_status()
for an overview of which parts of the header is understood.Note
Although the fundamental storage of the actual data has change, lots of inspiration for the parsing of the header has been drawn from the parser in the ImportAgilent.m file in the chemplexity/chromatography project project. All credit for the parts of the header parsing that could be reused goes to the author of that project.
-
values
¶ The internsity values (y-value) or the spectrum. The unit for the values is given in metadata[‘units’]
Type: numpy.array
-
__init__
(filepath)[source]¶ Instantiate object
Parameters: filepath (str) – The path of the data file
-
times
¶ The time values (x-value) for the data set in minutes
-