Includes all processors through release 1.2.0

I looked around at what can be done with Apache NiFi and didn’t notice a list of processors without looking at the code or building the project. I think a list of available processors, the work horse of Apache Nifi, would greatly help decide if it is right for certain needs. So, I went into the usage guide in the Apache Nifi UI and pulled a list of processors and a quick description for those who want to know what possibilities there are before getting into nifi itself!

List of processors

With new releases of Nifi, the number of processors have increased from the original 53 to 154 to what we currently have today! Here is a list of all processors, listed alphabetically, that are currently in Apache Nifi as of the most recent release. Each one links to a description of the processor further down. The Usage documentation available in the web ui has much more detail about each processor, it’s properties, modifiable attributes, and relationships and each processor has it’s own page in the UI, so here is just a quick overview. Again, this content is taken directly from Nifi’s Usage guide in their web UI and all credit/rights belong to them under the Apache 2.0 License.

Nifi has improved their documentation, which was originally only available when running apache nifi. The documentation now is produced through the build process and has been added to their website. So if you need more information or more detail about each processor just check there.

AttributeRollingWindow 1.2.0

Track a Rolling Window based on evaluating an Expression Language expression on each FlowFile and add that value to the processor’s state. Each FlowFile will be emitted with the count of FlowFiles and total aggregate value of values processed in the current time window.

AttributesToJSON 1.2.0

Generates a JSON representation of the input FlowFile Attributes. The resulting JSON can be written to either a new Attribute ‘JSONAttributes’ or written to the FlowFile as content.

Base64EncodeContent 1.2.0

Encodes or decodes content to and from base64

CaptureChangeMySQL 1.2.0

Retrieves Change Data Capture (CDC) events from a MySQL database. CDC Events include INSERT, UPDATE, DELETE operations. Events are output as individual flow files ordered by the time at which the operation occurred.

CompareFuzzyHash 1.2.0

Compares an attribute containing a Fuzzy Hash against a file containing a list of fuzzy hashes, appending an attribute to the FlowFile in case of a successful match.

CompressContent 1.2.0

Compresses or decompresses the contents of FlowFiles using a user-specified compression algorithm and updates the mime.type attribute as appropriate

ConnectWebSocket 1.2.0

Acts as a WebSocket client endpoint to interact with a remote WebSocket server. FlowFiles are transferred to downstream relationships according to received message types as WebSocket client configured with this processor receives messages from remote WebSocket server.

ConsumeAMQP 1.2.0

Consumes AMQP Message transforming its content to a FlowFile and transitioning it to ‘success’ relationship

ConsumeEWS 1.2.0

Consumes messages from Microsoft Exchange using Exchange Web Services. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeIMAP 1.2.0

Consumes messages from Email Server using IMAP protocol. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeJMS 1.2.0

Consumes JMS Message of type BytesMessage or TextMessage transforming its content to a FlowFile and transitioning it to ‘success’ relationship. JMS attributes such as headers and properties will be copied as FlowFile attributes.

ConsumeKafka 1.2.0

Consumes messages from Apache Kafka specifically built against the Kafka 0.9.x Consumer API. Please note there are cases where the publisher can get into an indefinite stuck state. We are closely monitoring how this evolves in the Kafka community and will take advantage of those fixes as soon as we can. In the mean time it is possible to enter states where the only resolution will be to restart the JVM NiFi runs on. The complementary NiFi processor for sending messages is PublishKafka.

ConsumeKafka_0_10 1.2.0

Consumes messages from Apache Kafka specifically built against the Kafka 0.10.x Consumer API. Please note there are cases where the publisher can get into an indefinite stuck state. We are closely monitoring how this evolves in the Kafka community and will take advantage of those fixes as soon as we can. In the meantime it is possible to enter states where the only resolution will be to restart the JVM NiFi runs on. The complementary NiFi processor for sending messages is PublishKafka_0_10.

ConsumeKafkaRecord_0_10 1.2.0

Consumes messages from Apache Kafka specifically built against the Kafka 0.10.x Consumer API. The complementary NiFi processor for sending messages is PublishKafka_0_10. Please note that, at this time, the Processor assumes that all records that are retrieved from a given partition have the same schema. If any of the Kafka messages are pulled but cannot be parsed or written with the configured Record Reader or Record Writer, the contents of the message will be written to a separate FlowFile, and that FlowFile will be transferred to the ‘parse.failure’ relationship. Otherwise, each FlowFile is sent to the ‘success’ relationship and may contain many individual messages within the single FlowFile. A ‘record.count’ attribute is added to indicate how many messages are contained in the FlowFile.

ConsumeMQTT 1.2.0

Subscribes to a topic and receives messages from an MQTT broker

ConsumePOP3 1.2.0

Consumes messages from Email Server using POP3 protocol. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeWindowsEventLog 1.2.0

Registers a Windows Event Log Subscribe Callback to receive FlowFiles from Events on Windows. These can be filtered via channel and XPath.

ControlRate 1.2.0

Controls the rate at which data is transferred to follow-on processors. If you configure a very small Time Duration, then the accuracy of the throttle gets worse. You can improve this accuracy by decreasing the Yield Duration, at the expense of more Tasks given to the processor.

ConvertAvroSchema 1.2.0

Convert records from one Avro schema to another, including support for flattening and simple type conversions

ConvertAvroToJSON 1.2.0

Converts a Binary Avro record into a JSON object. This processor provides a direct mapping of an Avro field to a JSON field, such that the resulting JSON will have the same hierarchical structure as the Avro document. Note that the Avro schema information will be lost, as this is not a translation from binary Avro to JSON formatted Avro. The output JSON is encoded the UTF-8 encoding. If an incoming FlowFile contains a stream of multiple Avro records, the resultant FlowFile will contain a JSON Array containing all of the Avro records or a sequence of JSON Objects. If an incoming FlowFile does not contain any records, an empty JSON object is the output. Empty/Single Avro record FlowFile inputs are optionally wrapped in a container as dictated by ‘Wrap Single Record’

ConvertAvroToORC 1.2.0

Converts an Avro record into ORC file format. This processor provides a direct mapping of an Avro record to an ORC record, such that the resulting ORC file will have the same hierarchical structure as the Avro document. If an incoming FlowFile contains a stream of multiple Avro records, the resultant FlowFile will contain a ORC file containing all of the Avro records. If an incoming FlowFile does not contain any records, an empty ORC file is the output. NOTE: Many Avro datatypes (collections, primitives, and unions of primitives, e.g.) can be converted to ORC, but unions of collections and other complex datatypes may not be able to be converted to ORC.

ConvertCharacterSet 1.2.0

Converts a FlowFile’s content from one character set to another

ConvertCSVToAvro 1.2.0

Converts CSV files to Avro according to an Avro Schema

ConvertExcelToCSVProcessor 1.2.0

Consumes a Microsoft Excel document and converts each worksheet to csv. Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor. Each output Flowfile’s contents will be formatted as a csv file where the each row from the excel sheet is output as a newline in the csv file. This processor is currently only capable of processing .xlsx (XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF ‘97(-2007) file format) documents. This processor also expects well formatted CSV content and will not escape cell’s containing invalid content such as newlines or additional commas.

ConvertJSONToAvro 1.2.0

Converts JSON files to Avro according to an Avro Schema

ConvertJSONToSQL 1.2.0

Converts a JSON-formatted FlowFile into an UPDATE, INSERT, or DELETE SQL statement. The incoming FlowFile is expected to be “flat” JSON message, meaning that it consists of a single JSON element and each field maps to a simple type. If a field maps to a JSON object, that JSON object will be interpreted as Text. If the input is an array of JSON elements, each element in the array is output as a separate FlowFile to the ‘sql’ relationship. Upon successful conversion, the original FlowFile is routed to the ‘original’ relationship and the SQL is routed to the ‘sql’ relationship.

ConvertRecord 1.2.0

Converts records from one data format to another using configured Record Reader and Record Write Controller Services. The Reader and Writer must be configured with “matching” schemas. By this, we mean the schemas must have the same field names. The types of the fields do not have to be the same if a field value can be coerced from one type to another. For instance, if the input schema has a field named “balance” of type double, the output schema can have a field named “balance” with a type of string, double, or float. If any field is present in the input that is not present in the output, the field will be left out of the output. If any field is specified in the output schema but is not present in the input data/schema, then the field will not be present in the output or will have a null value, depending on the writer.

CreateHadoopSequenceFile 1.2.0

Creates Hadoop Sequence Files from incoming flow files

DebugFlow 1.2.0

The DebugFlow processor aids testing and debugging the FlowFile framework by allowing various responses to be explicitly triggered in response to the receipt of a FlowFile or a timer event without a FlowFile if using timer or cron based scheduling. It can force responses needed to exercise or test various failure modes that can occur when a processor runs.

DeleteDynamoDB 1.2.0

Deletes a document from DynamoDB based on hash and range key. The key can be string or number. The request requires all the primary keys for the operation (hash or hash and range key)

DeleteGCSObject 1.2.0

Deletes objects from a Google Cloud Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

DeleteHDFS 1.2.0

Deletes one or more files or directories from HDFS. The path can be provided as an attribute from an incoming FlowFile, or a statically set path that is periodically removed. If this processor has an incoming connection, itwill ignore running on a periodic basis and instead rely on incoming FlowFiles to trigger a delete. Note that you may use a wildcard character to match multiple files or directories. If there are no incoming connections no flowfiles will be transfered to any output relationships. If there is an incoming flowfile then provided there are no detected failures it will be transferred to success otherwise it will be sent to false. If knowledge of globbed files deleted is necessary use ListHDFS first to produce a specific list of files to delete.

DeleteS3Object 1.2.0

Deletes FlowFiles on an Amazon S3 Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

DeleteSQS 1.2.0

Deletes a message from an Amazon Simple Queuing Service Queue

DetectDuplicate 1.2.0

Caches a value, computed from FlowFile attributes, for each incoming FlowFile and determines if the cached value has already been seen. If so, routes the FlowFile to ‘duplicate’ with an attribute named ‘original.identifier’ that specifies the original FlowFile’s “description”, which is specified in the property. If the FlowFile is not determined to be a duplicate, the Processor routes the FlowFile to 'non-duplicate'

DistributeLoad 1.2.0

Distributes FlowFiles to downstream processors based on a Distribution Strategy. If using the Round Robin strategy, the default is to assign each destination a weighting of 1 (evenly distributed). However, optional propertiescan be added to the change this; adding a property with the name ‘5’ and value ‘10’ means that the relationship with name ‘5’ will be receive 10 FlowFiles in each iteration instead of 1.

DuplicateFlowFile 1.2.0

Intended for load testing, this processor will create the configured number of copies of each incoming FlowFile

EncryptContent 1.2.0

Encrypts or Decrypts a FlowFile using either symmetric encryption with a password and randomly generated salt, or asymmetric encryption using a public and secret key.

EnforceOrder 1.2.0

Enforces expected ordering of FlowFiles those belong to the same data group. Although PriorityAttributePrioritizer can be used on a connection to ensure that flow files going through that connection are in priority order, depending on error-handling, branching, and other flow designs, it is possible for FlowFiles to get out-of-order. EnforceOrder can be used to enforce original ordering for those FlowFiles. [IMPORTANT] In order to take effect of EnforceOrder, FirstInFirstOutPrioritizer should be used at EVERY downstream relationship UNTIL the order of FlowFiles physically get FIXED by operation such as MergeContent or being stored to the final destination.

EvaluateJsonPath 1.2.0

Evaluates one or more JsonPath expressions against the content of a FlowFile. The results of those expressions are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. JsonPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid JsonPath expression. A Return Type of ‘auto-detect’ will make a determination based off the configured destination. When ‘Destination’ is set to ‘flowfile-attribute,’ a return type of ‘scalar’ will be used. When ‘Destination’ is set to ‘flowfile-content,’ a return type of ‘JSON’ will be used.If the JsonPath evaluates to a JSON array or JSON object and the Return Type is set to ‘scalar’ the FlowFile will be unmodified and will be routed to failure. A Return Type of JSON can return scalar values if the provided JsonPath evaluates to the specified value and will be routed as a match.If Destination is ‘flowfile-content’ and the JsonPath does not evaluate to a defined path, the FlowFile will be routed to ‘unmatched’ without having its contents modified. If Destination is flowfile-attribute and the expression matches nothing, attributes will be created with empty strings as the value, and the FlowFile will always be routed to ‘matched.’

EvaluateXPath 1.2.0

Evaluates one or more XPaths against the content of a FlowFile. The results of those XPaths are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid XPath expression. If the XPath evaluates to more than one node and the Return Type is set to ‘nodeset’ (either directly, or via ‘auto-detect’ with a Destination of ‘flowfile-content’), the FlowFile will be unmodified and will be routed to failure. If the XPath does not evaluate to a Node, the FlowFile will be routed to ‘unmatched’ without having its contents modified. If Destination is flowfile-attribute and the expression matches nothing, attributes will be created with empty strings as the value, and the FlowFile will always be routed to ‘matched’

EvaluateXQuery 1.2.0

Evaluates one or more XQueries against the content of a FlowFile. The results of those XQueries are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XQueries are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is ‘flowfile-attribute’; otherwise, the property name is ignored). The value of the property must be a valid XQuery. If the XQuery returns more than one result, new attributes or FlowFiles (for Destinations of ‘flowfile-attribute’ or ‘flowfile-content’ respectively) will be created for each result (attributes will have a ‘.n’ one-up number appended to the specified attribute name). If any provided XQuery returns a result, the FlowFile(s) will be routed to ‘matched’. If no provided XQuery returns a result, the FlowFile will be routed to ‘unmatched’. If the Destination is ‘flowfile-attribute’ and the XQueries matche nothing, no attributes will be applied to the FlowFile.

ExecuteFlumeSink 1.2.0

Execute a Flume sink. Each input FlowFile is converted into a Flume Event for processing by the sink.

ExecuteFlumeSource 1.2.0

Execute a Flume source. Each Flume Event is sent to the success relationship as a FlowFile

ExecuteProcess 1.2.0

Runs an operating system command specified by the user and writes the output of that command to a FlowFile. If the command is expected to be long-running, the Processor can output the partial data on a specified interval. When this option is used, the output is expected to be in textual format, as it typically does not make sense to split binary data on arbitrary time-based intervals.

ExecuteScript 1.2.0

Experimental - Executes a script given the flow file and a process session. The script is responsible for handling the incoming flow file (transfer to SUCCESS or remove, e.g.) as well as any flow files created by the script. If the handling is incomplete or incorrect, the session will be rolled back. Experimental: Impact of sustained usage not yet verified.

ExecuteSQL 1.2.0

Execute provided SQL select query. Query result will be converted to Avro format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute ‘executesql.row.count’ indicates how many rows were selected.

ExecuteStreamCommand 1.2.0

Executes an external command on the contents of a flow file, and creates a new flow file with the results of the command.

ExtractAvroMetadata 1.2.0

Extracts metadata from the header of an Avro datafile.

ExtractCCDAAttributes 1.2.0

Extracts information from an Consolidated CDA formatted FlowFile and provides individual attributes as FlowFile attributes. The attributes are named as . If the Parent is repeating, the naming will be . For example, section.act_07.observation.name=Essential hypertension

ExtractEmailAttachments 1.2.0

Extract attachments from a mime formatted email file, splitting them into individual flowfiles.

ExtractEmailHeaders 1.2.0

Using the flowfile content as source of data, extract header from an RFC compliant email file adding the relevant attributes to the flowfile. This processor does not perform extensive RFC validation but still requires a bare minimum compliance with RFC 2822

ExtractGrok 1.2.0

Evaluates one or more Grok Expressions against the content of a FlowFile, adding the results as attributes or replacing the content of the FlowFile with a JSON notation of the matched content

ExtractHL7Attributes 1.2.0

Extracts information from an HL7 (Health Level 7) formatted FlowFile and adds the information as FlowFile Attributes. The attributes are named as . If the segment is repeating, the naming will be . For example, we may have an attribute named "MHS.12" with a value of "2.1" and an attribute named "OBX_11.3" with a value of "93000^CPT4".

ExtractImageMetadata 1.2.0

Extract the image metadata from flowfiles containing images. This processor relies on this metadata extractor library https://github.com/drewnoakes/metadata-extractor. It extracts a long list of metadata types including but not limited to EXIF, IPTC, XMP and Photoshop fields. For the full list visit the library’s website.NOTE: The library being used loads the images into memory so extremely large images may cause problems.

ExtractMediaMetadata 1.2.0

Extract the content metadata from flowfiles containing audio, video, image, and other file types. This processor relies on the Apache Tika project for file format detection and parsing. It extracts a long list of metadata types for media files including audio, video, and print media formats.NOTE: the attribute names and content extracted may vary across upgrades because parsing is performed by the external Tika tools which in turn depend on other projects for metadata extraction. For the more details and the list of supported file types, visit the library’s website at http://tika.apache.org/.

ExtractText 1.2.0

Evaluates one or more Regular Expressions against the content of a FlowFile. The results of those Regular Expressions are assigned to FlowFile Attributes. Regular Expressions are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed. The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided, with the exception of a capturing group that is optional and does not match - for example, given the attribute name “regex” and expression “abc(def)?(g)” we would add an attribute “regex.1” with a value of “def” if the “def” matched. If the “def” did not match, no attribute named “regex.1” would be added but an attribute named “regex.2” with a value of “g” will be added regardless.The value of the property must be a valid Regular Expressions with one or more capturing groups. If the Regular Expression matches more than once, only the first match will be used unless the property enabling repeating capture group is set to true. If any provided Regular Expression matches, the FlowFile(s) will be routed to ‘matched’. If no provided Regular Expression matches, the FlowFile will be routed to ‘unmatched’ and no attributes will be applied to the FlowFile.

ExtractTNEFAttachments 1.2.0

Extract attachments from a mime formatted email file, splitting them into individual flowfiles.

FetchAzureBlobStorage 1.2.0

Retrieves contents of an Azure Storage Blob, writing the contents to the content of the FlowFile

FetchDistributedMapCache 1.2.0

Computes a cache key from FlowFile attributes, for each incoming FlowFile, and fetches the value from the Distributed Map Cache associated with that key. The incoming FlowFile’s content is replaced with the binary data received by the Distributed Map Cache. If there is no value stored under that key then the flow file will be routed to ‘not-found’. Note that the processor will always attempt to read the entire cached value into memory before placing it in it’s destination. This could be potentially problematic if the cached value is very large.

FetchElasticsearch 1.2.0

Retrieves a document from Elasticsearch using the specified connection properties and the identifier of the document to retrieve. If the cluster has been configured for authorization and/or secure transport (SSL/TLS) and the Shield plugin is available, secure connections can be made. This processor supports Elasticsearch 2.x clusters.

FetchElasticsearch5 1.2.0

Retrieves a document from Elasticsearch using the specified connection properties and the identifier of the document to retrieve. If the cluster has been configured for authorization and/or secure transport (SSL/TLS), and the X-Pack plugin is available, secure connections can be made. This processor supports Elasticsearch 5.x clusters.

FetchElasticsearchHttp 1.2.0

Retrieves a document from Elasticsearch using the specified connection properties and the identifier of the document to retrieve. Note that the full body of the document will be read into memory before being written to a Flow File for transfer.

FetchFile 1.2.0

Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized.

FetchFTP 1.2.0

Fetches the content of a file from a remote SFTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

FetchGCSObject 1.2.0

Fetches a file from a Google Cloud Bucket. Designed to be used in tandem with ListGCSBucket.

FetchHBaseRow 1.2.0

Fetches a row from an HBase table. The Destination property controls whether the cells are added as flow file attributes, or the row is written to the flow file content as JSON. This processor may be used to fetch a fixed row on a interval by specifying the table and row id directly in the processor, or it may be used to dynamically fetch rows by referencing the table and row id from incoming flow files.

FetchHDFS 1.2.0

Retrieves a file from HDFS. The content of the incoming FlowFile is replaced by the content of the file in HDFS. The file in HDFS is left intact without any changes being made to it.

FetchParquet 1.2.0

Reads from a given Parquet file and writes records to the content of the flow file using the selected record writer. The original Parquet file will remain unchanged, and the content of the flow file will be replaced with records of the selected type. This processor can be used with ListHDFS or ListFile to obtain a listing of files to fetch.

FetchS3Object 1.2.0

Retrieves the contents of an S3 Object and writes it to the content of a FlowFile

FetchSFTP 1.2.0

Fetches the content of a file from a remote SFTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

FuzzyHashContent 1.2.0

Calculates a fuzzy/locality-sensitive hash value for the Content of a FlowFile and puts that hash value on the FlowFile as an attribute whose name is determined by the property.Note: this processor only offers non-cryptographic hash algorithms. And it should be not be seen as a replacement to the HashContent processor.Note: The underlying library loads the entirety of the streamed content into and performs result evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being evaluated by this processor and the hardware supporting it especially when working against large files.

GenerateFlowFile 1.2.0

This processor creates FlowFiles with random data or custom content. GenerateFlowFile is usefulfor load testing, configuration, and simulation.

GenerateTableFetch 1.2.0

Generates SQL select queries that fetch “pages” of rows from a table. The partition size property, along with the table’s row count, determine the size and number of pages and generated FlowFiles. In addition, incremental fetching can be achieved by setting Maximum-Value Columns, which causes the processor to track the columns’ maximum values, thus only fetching rows whose columns’ values exceed the observed maximums. This processor is intended to be run on the Primary Node only.

This processor can accept incoming connections; the behavior of the processor is different whether incoming connections are provided:

  • If no incoming connection(s) are specified, the processor will generate SQL queries on the specified processor schedule. Expression Language is supported for many fields, but no flow file attributes are available. However the properties will be evaluated using the Variable Registry.
  • If incoming connection(s) are specified and no flow file is available to a processor task, no work will be performed.
  • If incoming connection(s) are specified and a flow file is available to a processor task, the flow file’s attributes may be used in Expression Language for such fields as Table Name and others. However, the Max-Value Columns and Columns to Return fields must be empty or refer to columns that are available in each specified table.

GeoEnrichIP 1.2.0

Looks up geolocation information for an IP address and adds the geo information to FlowFile attributes. The geo data is provided as a MaxMind database. The attribute that contains the IP address to lookup is provided by the ‘IP Address Attribute’ property. If the name of the attribute provided is ‘X’, then the the attributes added by enrichment will take the form X.geo.

GetAzureEventHub 1.2.0

Receives messages from a Microsoft Azure Event Hub, writing the contents of the Azure message to the content of the FlowFile

GetCouchbaseKey 1.2.0

Get a document from Couchbase Server via Key/Value access. The ID of the document to fetch may be supplied by setting the property. NOTE: if the Document Id property is not set, the contents of the FlowFile will be read to determine the Document Id, which means that the contents of the entire FlowFile will be buffered in memory.

GetDynamoDB 1.2.0

Retrieves a document from DynamoDB based on hash and range key. The key can be string or number.For any get request all the primary keys are required (hash or hash and range based on the table keys).A Json Document (‘Map’) attribute of the DynamoDB item is read into the content of the FlowFile.

GetFile 1.2.0

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn’t have at least read permissions for.

GetFTP 1.2.0

Fetches files from an FTP Server and creates FlowFiles from them

GetHBase 1.2.0

This Processor polls HBase for any records in the specified table. The processor keeps track of the timestamp of the cells that it receives, so that as new records are pushed to HBase, they will automatically be pulled. Each record is output in JSON format, as {“row”: “", "cells": { "<column 1 family>:<column 1 qualifier>": "<cell 1 value>", "<column 2 family>:<column 2 qualifier>": "<cell 2 value>", ... }}. For each record received, a Provenance RECEIVE event is emitted with the format hbase://<table name>/, where is the UTF-8 encoded value of the row's key.

GetHDFS 1.2.0

Fetch files from Hadoop Distributed File System (HDFS) into FlowFiles. This Processor will delete the file from HDFS after fetching it.

GetHDFSEvents 1.2.0

This processor polls the notification events provided by the HdfsAdmin API. Since this uses the HdfsAdmin APIs it is required to run as an HDFS super user. Currently there are six types of events (append, close, create, metadata, rename, and unlink). Please see org.apache.hadoop.hdfs.inotify.Event documentation for full explanations of each event. This processor will poll for new events based on a defined duration. For each event received a new flow file will be created with the expected attributes and the event itself serialized to JSON and written to the flow file’s content. For example, if event.type is APPEND then the content of the flow file will contain a JSON file containing the information about the append event. If successful the flow files are sent to the ‘success’ relationship. Be careful of where the generated flow files are stored. If the flow files are stored in one of processor’s watch directories there will be a never ending flow of events. It is also important to be aware that this processor must consume all events. The filtering must happen within the processor. This is because the HDFS admin’s event notifications API does not have filtering.

GetHDFSSequenceFile 1.2.0

Fetch sequence files from Hadoop Distributed File System (HDFS) into FlowFiles

GetHTMLElement 1.2.0

Extracts HTML element values from the incoming flowfile’s content using a CSS selector. The incoming HTML is first converted into a HTML Document Object Model so that HTML elements may be selected in the similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string. The result of “querying” the HTML DOM may produce 0-N results. If no results are found the flowfile will be transferred to the “element not found” relationship to indicate so to the end user. If N results are found a new flowfile will be created and emitted for each result. The query result will either be placed in the content of the new flowfile or as an attribute of the new flowfile. By default the result is written to an attribute. This can be controlled by the “Destination” property. Resulting query values may also have data prepended or appended to them by setting the value of property “Prepend Element Value” or “Append Element Value”. Prepended and appended values are treated as string values and concatenated to the result retrieved from the HTML DOM query operation. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

GetHTTP 1.2.0

Fetches data from an HTTP or HTTPS URL and writes the data to the content of a FlowFile. Once the content has been fetched, the ETag and Last Modified dates are remembered (if the web server supports these concepts). This allows the Processor to fetch new data only if the remote data has changed or until the state is cleared. That is, once the content has been fetched from the given URL, it will not be fetched again until the content on the remote server changes. Note that due to limitations on state management, stored “last modified” and etag fields never expire. If the URL in GetHttp uses Expression Language that is unbounded, there is the potential for Out of Memory Errors to occur.

GetIgniteCache 1.2.0

Get the byte array from Ignite Cache and adds it as the content of a FlowFile.The processor uses the value of FlowFile attribute (Ignite cache entry key) as the cache key lookup. If the entry corresponding to the key is not found in the cache an error message is associated with the FlowFile Note - The Ignite Kernel periodically outputs node performance statistics to the logs. This message can be turned off by setting the log level for logger ‘org.apache.ignite’ to WARN in the logback.xml configuration file.

GetJMSQueue 1.2.0

Pulls messages from a JMS Queue, creating a FlowFile for each JMS Message or bundle of messages, as configured

GetJMSTopic 1.2.0

Pulls messages from a JMS Topic, creating a FlowFile for each JMS Message or bundle of messages, as configured

GetKafka 1.2.0

Fetches messages from Apache Kafka, specifically for 0.8.x versions. The complementary NiFi processor for sending messages is PutKafka.

GetMongo 1.2.0

Creates FlowFiles from documents in MongoDB

GetSFTP 1.2.0

Fetches files from an SFTP Server and creates FlowFiles from them

GetSNMP 1.2.0

Retrieves information from SNMP Agent and outputs a FlowFile with information in attributes and without any content

GetSolr 1.2.0

Queries Solr and outputs the results as a FlowFile

GetSplunk 1.2.0

Retrieves data from Splunk Enterprise.

GetSQS 1.2.0

Fetches messages from an Amazon Simple Queuing Service Queue

GetTCP 1.2.0

Connects over TCP to the provided endpoint(s). Received data will be written as content to the FlowFile

GetTwitter 1.2.0

Pulls status changes from Twitter’s streaming API

HandleHttpRequest 1.2.0

Starts an HTTP Server and listens for HTTP Requests. For each request, creates a FlowFile and transfers to ‘success’. This Processor is designed to be used in conjunction with the HandleHttpResponse Processor in order to create a Web Service

HandleHttpResponse 1.2.0

Sends an HTTP Response to the Requestor that generated a FlowFile. This Processor is designed to be used in conjunction with the HandleHttpRequest in order to create a web service.

HashAttribute 1.2.0

Hashes together the key/value pairs of several FlowFile Attributes and adds the hash as a new attribute. Optional properties are to be added such that the name of the property is the name of a FlowFile Attribute to consider and the value of the property is a regular expression that, if matched by the attribute value, will cause that attribute to be used as part of the hash. If the regular expression contains a capturing group, only the value of the capturing group will be used.

HashContent 1.2.0

Calculates a hash value for the Content of a FlowFile and puts that hash value on the FlowFile as an attribute whose name is determined by the property

IdentifyMimeType 1.2.0

Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, an attribute with the name ‘mime.type’ is added with the value being the MIME Type. If the MIME Type cannot be determined, the value will be set to ‘application/octet-stream’. In addition, the attribute mime.extension will be set if a common file extension for the MIME Type is known.

InferAvroSchema 1.2.0

Examines the contents of the incoming FlowFile to infer an Avro schema. The processor will use the Kite SDK to make an attempt to automatically generate an Avro schema from the incoming content. When inferring the schema from JSON data the key names will be used in the resulting Avro schema definition. When inferring from CSV data a “header definition” must be present either as the first line of the incoming data or the “header definition” must be explicitly set in the property “CSV Header Definition”. A “header definition” is simply a single comma separated line defining the names of each column. The “header definition” is required in order to determine the names that should be given to each field in the resulting Avro definition. When inferring data types the higher order data type is always used if there is ambiguity. For example when examining numerical values the type may be set to “long” instead of “integer” since a long can safely hold the value of any “integer”. Only CSV and JSON content is currently supported for automatically inferring an Avro schema. The type of content present in the incoming FlowFile is set by using the property “Input Content Type”. The property can either be explicitly set to CSV, JSON, or “use mime.type value” which will examine the value of the mime.type attribute on the incoming FlowFile to determine the type of content present.

InvokeHTTP 1.2.0

An HTTP client processor which can interact with a configurable HTTP Endpoint. The destination URL and HTTP Method are configurable. FlowFile attributes are converted to HTTP headers and the FlowFile contents are included as the body of the request (if the HTTP Method is PUT, POST or PATCH).

InvokeScriptedProcessor 1.2.0

Experimental - Invokes a script engine for a Processor defined in the given script. The script must define a valid class that implements the Processor interface, and it must set a variable ‘processor’ to an instance of the class. Processor methods such as onTrigger() will be delegated to the scripted Processor instance. Also any Relationships or PropertyDescriptors defined by the scripted processor will be added to the configuration dialog. Experimental: Impact of sustained usage not yet verified.

ISPEnrichIP 1.2.0

Looks up ISP information for an IP address and adds the information to FlowFile attributes. The ISP data is provided as a MaxMind ISP database (Note that this is NOT the same as the GeoLite database utilizedby some geo enrichment tools). The attribute that contains the IP address to lookup is provided by the ‘IP Address Attribute’ property. If the name of the attribute provided is ‘X’, then the the attributes added by enrichment will take the form X.isp.

JoltTransformJSON 1.2.0

Applies a list of Jolt specifications to the flowfile JSON payload. A new FlowFile is created with transformed content and is routed to the ‘success’ relationship. If the JSON transform fails, the original FlowFile is routed to the ‘failure’ relationship.

ListAzureBlobStorage 1.2.0

Lists blobs in an Azure Storage container. Listing details are attached to an empty FlowFile for use with FetchAzureBlobStorage. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListDatabaseTables 1.2.0

Generates a set of flow files, each containing attributes corresponding to metadata about a table from a database connection. Once metadata about a table has been fetched, it will not be fetched again until the Refresh Interval (if set) has elapsed, or until state has been manually cleared.

ListenBeats 1.2.0

Listens for messages sent by libbeat compatible clients (e.g. filebeats, metricbeats, etc) using Libbeat’s ‘output.logstash’, writing its JSON formatted payload to the content of a FlowFile.This processor replaces the now deprecated ListenLumberjack

ListenHTTP 1.2.0

Starts an HTTP Server and listens on a given base path to transform incoming requests into FlowFiles. The default URI of the Service will be http://{hostname}:{port}/contentListener. Only HEAD and POST requests are supported. GET, PUT, and DELETE will result in an error and the HTTP response status code 405.

ListenLumberjack 1.2.0

This processor is deprecated and may be removed in the near future. Listens for Lumberjack messages being sent to a given port over TCP. Each message will be acknowledged after successfully writing the message to a FlowFile. Each FlowFile will contain data portion of one or more Lumberjack frames. In the case where the Lumberjack frames contain syslog messages, the output of this processor can be sent to a ParseSyslog processor for further processing.

ListenRELP 1.2.0

Listens for RELP messages being sent to a given port over TCP. Each message will be acknowledged after successfully writing the message to a FlowFile. Each FlowFile will contain data portion of one or more RELP frames. In the case where the RELP frames contain syslog messages, the output of this processor can be sent to a ParseSyslog processor for further processing.

ListenSMTP 1.2.0

This processor implements a lightweight SMTP server to an arbitrary port, allowing nifi to listen for incoming email. Note this server does not perform any email validation. If direct exposure to the internet is sought, it may be a better idea to use the combination of NiFi and an industrial scale MTA (e.g. Postfix). Threading for this processor is managed by the underlying smtp server used so the processor need not support more than one thread.

ListenSyslog 1.2.0

Listens for Syslog messages being sent to a given port over TCP or UDP. Incoming messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The format of each message is: ()(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd'T'HH:mm:ss.SZ" or "yyyy-MM-dd'T'HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If an incoming messages matches one of these patterns, the message will be parsed and the individual pieces will be placed in FlowFile attributes, with the original message in the content of the FlowFile. If an incoming message does not match one of these patterns it will not be parsed and the syslog.valid attribute will be set to false with the original message in the content of the FlowFile. Valid messages will be transferred on the success relationship, and invalid messages will be transferred on the invalid relationship.

ListenTCP 1.2.0

Listens for incoming TCP connections and reads data from each connection using a line separator as the message demarcator. The default behavior is for each message to produce a single FlowFile, however this can be controlled by increasing the Batch Size to a larger value for higher throughput. The Receive Buffer Size must be set as large as the largest messages expected to be received, meaning if every 100kb there is a line separator, then the Receive Buffer Size must be greater than 100kb.

ListenUDP 1.2.0

Listens for Datagram Packets on a given port. The default behavior produces a FlowFile per datagram, however for higher throughput the Max Batch Size property may be increased to specify the number of datagrams to batch together in a single FlowFile. This processor can be restricted to listening for datagrams from a specific remote host and port by specifying the Sending Host and Sending Host Port properties, otherwise it will listen for datagrams from all hosts and ports.

ListenWebSocket 1.2.0

Acts as a WebSocket server endpoint to accept client connections. FlowFiles are transferred to downstream relationships according to received message types as the WebSocket server configured with this processor receives client requests

ListFile 1.2.0

Retrieves a listing of files from the local filesystem. For each file that is listed, creates a FlowFile that represents the file so that it can be fetched in conjunction with FetchFile. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetFile, this Processor does not delete any data from the local filesystem.

ListFTP 1.2.0

Performs a listing of the files residing on an FTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchFTP in order to fetch those files.

ListGCSBucket 1.2.0

Retrieves a listing of objects from an GCS bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchGCSObject. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListHDFS 1.2.0

Retrieves a listing of files from HDFS. For each file that is listed in HDFS, creates a FlowFile that represents the HDFS file so that it can be fetched in conjunction with FetchHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.

ListS3 1.2.0

Retrieves a listing of objects from an S3 bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchS3Object. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListSFTP 1.2.0

Performs a listing of the files residing on an SFTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchSFTP in order to fetch those files.

LogAttribute 1.2.0

No description provided.

MergeContent 1.2.0

Merges a Group of FlowFiles together based on a user-defined strategy and packages them into a single FlowFile. It is recommended that the Processor be configured with only a single incoming connection, as Group of FlowFiles will not be created from FlowFiles in different connections. This processor updates the mime.type attribute as appropriate.

ModifyBytes 1.2.0

Discard byte range at the start and end or all content of a binary file.

ModifyHTMLElement 1.2.0

Modifies the value of an existing HTML element. The desired element to be modified is located by using CSS selector syntax. The incoming HTML is first converted into a HTML Document Object Model so that HTML elements may be selected in the similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string to find the element the user desires to modify. If the HTML element is found the element’s value is updated in the DOM using the value specified “Modified Value” property. All DOM elements that match the CSS selector will be updated. Once all of the DOM elements have been updated the DOM is rendered to HTML and the result replaces the flowfile content with the updated HTML. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

MonitorActivity 1.2.0

Monitors the flow for activity and sends out an indicator when the flow has not had any data for some specified amount of time and again when the flow’s activity is restored

Notify 1.2.0

Caches a release signal identifier in the distributed cache, optionally along with the FlowFile’s attributes. Any flow files held at a corresponding Wait processor will be released once this signal in the cache is discovered.

ParseCEF 1.2.0

Parses the contents of a CEF formatted message and adds attributes to the FlowFile for headers and extensions of the parts of the CEF message. Note: This Processor expects CEF messages WITHOUT the syslog headers (i.e. starting at “CEF:0”

ParseEvtx 1.2.0

Parses the contents of a Windows Event Log file (evtx) and writes the resulting XML to the FlowFile

ParseSyslog 1.2.0

Attempts to parses the contents of a Syslog message in accordance to RFC5424 and RFC3164 formats and adds attributes to the FlowFile for each of the parts of the Syslog message.Note: Be mindfull that RFC3164 is informational and a wide range of different implementations are present in the wild. If messages fail parsing, considering using RFC5424 or using a generic parsing processors such as ExtractGrok.

PostHTTP 1.2.0

Performs an HTTP Post with the content of the FlowFile

PublishAMQP 1.2.0

Creates a AMQP Message from the contents of a FlowFile and sends the message to an AMQP Exchange.In a typical AMQP exchange model, the message that is sent to the AMQP Exchange will be routed based on the ‘Routing Key’ to its final destination in the queue (the binding). If due to some misconfiguration the binding between the Exchange, Routing Key and Queue is not set up, the message will have no final destination and will return (i.e., the data will not make it to the queue). If that happens you will see a log in both app-log and bulletin stating to that effect. Fixing the binding (normally done by AMQP administrator) will resolve the issue.

PublishJMS 1.2.0

Creates a JMS Message from the contents of a FlowFile and sends it to a JMS Destination (queue or topic) as JMS BytesMessage. FlowFile attributes will be added as JMS headers and/or properties to the outgoing JMS message.

PublishKafka 1.2.0

Sends the contents of a FlowFile as a message to Apache Kafka using the Kafka 0.9.x Producer. The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. Please note there are cases where the publisher can get into an indefinite stuck state. We are closely monitoring how this evolves in the Kafka community and will take advantage of those fixes as soon as we can. In the mean time it is possible to enter states where the only resolution will be to restart the JVM NiFi runs on. The complementary NiFi processor for fetching messages is ConsumeKafka.

PublishKafka_0_10 1.2.0

Sends the contents of a FlowFile as a message to Apache Kafka using the Kafka 0.10.x Producer API.The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. Please note there are cases where the publisher can get into an indefinite stuck state. We are closely monitoring how this evolves in the Kafka community and will take advantage of those fixes as soon as we can. In the meantime it is possible to enter states where the only resolution will be to restart the JVM NiFi runs on. The complementary NiFi processor for fetching messages is ConsumeKafka_0_10.

PublishKafkaRecord_0_10 1.2.0

Sends the contents of a FlowFile as individual records to Apache Kafka using the Kafka 0.10.x Producer API. The contents of the FlowFile are expected to be record-oriented data that can be read by the configured Record Reader. Please note there are cases where the publisher can get into an indefinite stuck state. We are closely monitoring how this evolves in the Kafka community and will take advantage of those fixes as soon as we can. In the meantime it is possible to enter states where the only resolution will be to restart the JVM NiFi runs on. The complementary NiFi processor for fetching messages is ConsumeKafka_0_10_Record.

PublishMQTT 1.2.0

Publishes a message to an MQTT topic

PutAzureBlobStorage 1.2.0

Puts content into an Azure Storage Blob

PutAzureEventHub 1.2.0

Sends the contents of a FlowFile to a Windows Azure Event Hub. Note: the content of the FlowFile will be buffered into memory before being sent, so care should be taken to avoid sending FlowFiles to this Processor that exceed the amount of Java Heap Space available.

PutCassandraQL 1.2.0

Execute provided Cassandra Query Language (CQL) statement on a Cassandra 1.x, 2.x, or 3.0.x cluster. The content of an incoming FlowFile is expected to be the CQL command to execute. The CQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention cql.args.N.type and cql.args.N.value, where N is a positive integer. The cql.args.N.type is expected to be a lowercase string indicating the Cassandra type.

PutCloudWatchMetric 1.2.0

Publishes metrics to Amazon CloudWatch

PutCouchbaseKey 1.2.0

Put a document to Couchbase Server via Key/Value access.

PutDatabaseRecord 1.2.0

The PutDatabaseRecord processor uses a specified RecordReader to input (possibly multiple) records from an incoming flow file. These records are translated to SQL statements and executed as a single batch. If any errors occur, the flow file is routed to failure or retry, and if the records are transmitted successfully, the incoming flow file is routed to success. The type of statement executed by the processor is specified via the Statement Type property, which accepts some hard-coded values such as INSERT, UPDATE, and DELETE, as well as ‘Use statement.type Attribute’, which causes the processor to get the statement type from a flow file attribute. IMPORTANT: If the Statement Type is UPDATE, then the incoming records must not alter the value(s) of the primary keys (or user-specified Update Keys). If such records are encountered, the UPDATE statement issued to the database may do nothing (if no existing records with the new primary key values are found), or could inadvertently corrupt the existing data (by changing records for which the new values of the primary keys exist).

PutDistributedMapCache 1.2.0

Gets the content of a FlowFile and puts it to a distributed map cache, using a cache key computed from FlowFile attributes. If the cache already contains the entry and the cache update strategy is ‘keep original’ the entry is not replaced.’

PutDynamoDB 1.2.0

Puts a document from DynamoDB based on hash and range key. The table can have either hash and range or hash key alone. Currently the keys supported are string and number and value can be json document. In case of hash and range keys both key are required for the operation. The FlowFile content must be JSON. FlowFile content is mapped to the specified Json Document attribute in the DynamoDB item.

PutElasticsearch 1.2.0

Writes the contents of a FlowFile to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document. If the cluster has been configured for authorization and/or secure transport (SSL/TLS) and the Shield plugin is available, secure connections can be made. This processor supports Elasticsearch 2.x clusters.

PutElasticsearch5 1.2.0

Writes the contents of a FlowFile to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document. If the cluster has been configured for authorization and/or secure transport (SSL/TLS), and the X-Pack plugin is available, secure connections can be made. This processor supports Elasticsearch 5.x clusters.

PutElasticsearchHttp 1.2.0

Writes the contents of a FlowFile to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document.

PutEmail 1.2.0

Sends an e-mail to configured recipients for each incoming FlowFile

PutFile 1.2.0

Writes the contents of a FlowFile to the local file system

PutFTP 1.2.0

Sends FlowFiles to an FTP Server

PutGCSObject 1.2.0

Puts flow files to a Google Cloud Bucket.

PutHBaseCell 1.2.0

Adds the Contents of a FlowFile to HBase as the value of a single cell

PutHBaseJSON 1.2.0

Adds rows to HBase based on the contents of incoming JSON documents. Each FlowFile must contain a single UTF-8 encoded JSON document, and any FlowFiles where the root element is not a single document will be routed to failure. Each JSON field name and value will become a column qualifier and value of the HBase row. Any fields with a null value will be skipped, and fields with a complex value will be handled according to the Complex Field Strategy. The row id can be specified either directly on the processor through the Row Identifier property, or can be extracted from the JSON document by specifying the Row Identifier Field Name property. This processor will hold the contents of all FlowFiles for the given batch in memory at one time.

PutHDFS 1.2.0

Write FlowFile data to Hadoop Distributed File System (HDFS)

PutHiveQL 1.2.0

Executes a HiveQL DDL/DML command (UPDATE, INSERT, e.g.). The content of an incoming FlowFile is expected to be the HiveQL command to execute. The HiveQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention hiveql.args.N.type and hiveql.args.N.value, where N is a positive integer. The hiveql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.

PutHiveStreaming 1.2.0

This processor uses Hive Streaming to send flow file data to an Apache Hive table. The incoming flow file is expected to be in Avro format and the table must exist in Hive. Please see the Hive documentation for requirements on the Hive table (format, partitions, etc.). The partition values are extracted from the Avro record based on the names of the partition columns as specified in the processor.

PutHTMLElement 1.2.0

Places a new HTML element in the existing HTML DOM. The desired position for the new HTML element is specified by using CSS selector syntax. The incoming HTML is first converted into a HTML Document Object Model so that HTML DOM location may be located in a similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string to find the position where the user desires to add the new HTML element. Once the new HTML element is added to the DOM it is rendered to HTML and the result replaces the flowfile content with the updated HTML. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

PutIgniteCache 1.2.0

Stream the contents of a FlowFile to Ignite Cache using DataStreamer. The processor uses the value of FlowFile attribute (Ignite cache entry key) as the cache key and the byte array of the FlowFile as the value of the cache entry value. Both the string key and a non-empty byte array value are required otherwise the FlowFile is transferred to the failure relation. Note - The Ignite Kernel periodically outputs node performance statistics to the logs. This message can be turned off by setting the log level for logger ‘org.apache.ignite’ to WARN in the logback.xml configuration file.

PutJMS 1.2.0

Creates a JMS Message from the contents of a FlowFile and sends the message to a JMS Server

PutKafka 1.2.0

Sends the contents of a FlowFile as a message to Apache Kafka, specifically for 0.8.x versions. The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. The complementary NiFi processor for fetching messages is GetKafka.

PutKinesisFirehose 1.2.0

Sends the contents to a specified Amazon Kinesis Firehose. In order to send data to firehose, the firehose delivery stream name has to be specified.

PutKinesisStream 1.2.0

Sends the contents to a specified Amazon Kinesis. In order to send data to Kinesis, the stream name has to be specified.

PutLambda 1.2.0

Sends the contents to a specified Amazon Lamba Function. The AWS credentials used for authentication must have permissions execute the Lambda function (lambda:InvokeFunction).The FlowFile content must be JSON.

PutMongo 1.2.0

Writes the contents of a FlowFile to MongoDB

PutParquet 1.2.0

Reads records from an incoming FlowFile using the provided Record Reader, and writes those records to a Parquet file. The schema for the Parquet file must be provided in the processor properties. This processor will first write a temporary dot file and upon successfully writing every record to the dot file, it will rename the dot file to it’s final name. If the dot file cannot be renamed, the rename operation will be attempted up to 10 times, and if still not successful, the dot file will be deleted and the flow file will be routed to failure. If any error occurs while reading records from the input, or writing records to the output, the entire dot file will be removed and the flow file will be routed to failure or retry, depending on the error.

PutRiemann 1.2.0

Send events to Riemann (http://riemann.io) when FlowFiles pass through this processor. You can use events to notify Riemann that a FlowFile passed through, or you can attach a more meaningful metric, such as, the time a FlowFile took to get to this processor. All attributes attached to events support the NiFi Expression Language.

PutS3Object 1.2.0

Puts FlowFiles to an Amazon S3 Bucket The upload uses either the PutS3Object method or PutS3MultipartUpload methods. The PutS3Object method send the file in a single synchronous call, but it has a 5GB size limit. Larger files are sent using the multipart upload methods that initiate, transfer the parts, and complete an upload. This multipart process saves state after each step so that a large upload can be resumed with minimal loss if the processor or cluster is stopped and restarted. A multipart upload consists of three steps 1) initiate upload, 2) upload the parts, and 3) complete the upload. For multipart uploads, the processor saves state locally tracking the upload ID and parts uploaded, which must both be provided to complete the upload. The AWS libraries select an endpoint URL based on the AWS region, but this can be overridden with the ‘Endpoint Override URL’ property for use with other S3-compatible endpoints. The S3 API specifies that the maximum file size for a PutS3Object upload is 5GB. It also requires that parts in a multipart upload must be at least 5MB in size, except for the last part. These limits are establish the bounds for the Multipart Upload Threshold and Part Size properties.

PutSFTP 1.2.0

Sends FlowFiles to an SFTP Server

PutSlack 1.2.0

Sends a message to your team on slack.com

PutSNS 1.2.0

Sends the content of a FlowFile as a notification to the Amazon Simple Notification Service

PutSolrContentStream 1.2.0

Sends the contents of a FlowFile as a ContentStream to Solr

PutSplunk 1.2.0

Sends logs to Splunk Enterprise over TCP, TCP + TLS/SSL, or UDP. If a Message Delimiter is provided, then this processor will read messages from the incoming FlowFile based on the delimiter, and send each message to Splunk. If a Message Delimiter is not provided then the content of the FlowFile will be sent directly to Splunk as if it were a single message.

PutSQL 1.2.0

Executes a SQL UPDATE or INSERT command. The content of an incoming FlowFile is expected to be the SQL command to execute. The SQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.

PutSQS 1.2.0

Publishes a message to an Amazon Simple Queuing Service Queue

PutSyslog 1.2.0

Sends Syslog messages to a given host and port over TCP or UDP. Messages are constructed from the “Message ___” properties of the processor which can use expression language to generate messages from incoming FlowFiles. The properties are used to construct messages of the form: ()(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The constructed messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd'T'HH:mm:ss.SZ" or "yyyy-MM-dd'T'HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If a message is constructed that does not form a valid Syslog message according to the above description, then it is routed to the invalid relationship. Valid messages are sent to the Syslog server and successes are routed to the success relationship, failures routed to the failure relationship.

PutTCP 1.2.0

The PutTCP processor receives a FlowFile and transmits the FlowFile content over a TCP connection to the configured TCP server. By default, the FlowFiles are transmitted over the same TCP connection (or pool of TCP connections if multiple input threads are configured). To assist the TCP server with determining message boundaries, an optional “Outgoing Message Delimiter” string can be configured which is appended to the end of each FlowFiles content when it is transmitted over the TCP connection. An optional “Connection Per FlowFile” parameter can be specified to change the behaviour so that each FlowFiles content is transmitted over a single TCP connection which is opened when the FlowFile is received and closed after the FlowFile has been sent. This option should only be used for low message volume scenarios, otherwise the platform may run out of TCP sockets.

PutUDP 1.2.0

The PutUDP processor receives a FlowFile and packages the FlowFile content into a single UDP datagram packet which is then transmitted to the configured UDP server. The user must ensure that the FlowFile content being fed to this processor is not larger than the maximum size for the underlying UDP transport. The maximum transport size will vary based on the platform setup but is generally just under 64KB. FlowFiles will be marked as failed if their content is larger than the maximum transport size.

PutWebSocket 1.2.0

Sends messages to a WebSocket remote endpoint using a WebSocket session that is established by either ListenWebSocket or ConnectWebSocket.

QueryCassandra 1.2.0

Execute provided Cassandra Query Language (CQL) select query on a Cassandra 1.x, 2.x, or 3.0.x cluster. Query result may be converted to Avro or JSON format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute ‘executecql.row.count’ indicates how many rows were selected.

QueryDatabaseTable 1.2.0

Generates and executes a SQL select query to fetch all rows whose values in the specified Maximum Value column(s) are larger than the previously-seen maxima. Query result will be converted to Avro format. Expression Language is supported for several properties, but no incoming connections are permitted. The Variable Registry may be used to provide values for any property containing Expression Language. If it is desired to leverage flow file attributes to perform these queries, the GenerateTableFetch and/or ExecuteSQL processors can be used for this purpose. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer or cron expression, using the standard scheduling methods. This processor is intended to be run on the Primary Node only. FlowFile attribute ‘querydbtable.row.count’ indicates how many rows were selected.

QueryDNS 1.2.0

A powerful DNS query processor primary designed to enrich DataFlows with DNS based APIs (e.g. RBLs, ShadowServer’s ASN lookup) but that can be also used to perform regular DNS lookups.

QueryElasticsearchHttp 1.2.0

Queries Elasticsearch using the specified connection properties. Note that the full body of each page of documents will be read into memory before being written to Flow Files for transfer. Also note that the Elasticsearch max_result_window index setting is the upper bound on the number of records that can be retrieved using this query. To retrieve more records, use the ScrollElasticsearchHttp processor.

QueryRecord 1.2.0

Evaluates one or more SQL queries against the contents of a FlowFile. The result of the SQL query then becomes the content of the output FlowFile. This can be used, for example, for field-specific filtering, transformation, and row-level filtering. Columns can be renamed, simple calculations and aggregations performed, etc. The Processor is configured with a Record Reader Controller Service and a Record Writer service so as to allow flexibility in incoming and outgoing data formats. The Processor must be configured with at least one user-defined property. The name of the Property is the Relationship to route data to, and the value of the Property is a SQL SELECT statement that is used to specify how input data should be transformed/filtered. The SQL statement must be valid ANSI SQL and is powered by Apache Calcite. If the transformation fails, the original FlowFile is routed to the ‘failure’ relationship. Otherwise, the data selected will be routed to the associated relationship. See the Processor Usage documentation for more information.

QueryWhois 1.2.0

A powerful whois query processor primary designed to enrich DataFlows with whois based APIs (e.g. ShadowServer’s ASN lookup) but that can be also used to perform regular whois lookups.

ReplaceText 1.2.0

Updates the content of a FlowFile by evaluating a Regular Expression (regex) against it and replacing the section of the content that matches the Regular Expression with some alternate value.

ReplaceTextWithMapping 1.2.0

Updates the content of a FlowFile by evaluating a Regular Expression against it and replacing the section of the content that matches the Regular Expression with some alternate value provided in a mapping file.

ResizeImage 1.2.0

Resizes an image to user-specified dimensions. This Processor uses the image codecs registered with the environment that NiFi is running in. By default, this includes JPEG, PNG, BMP, WBMP, and GIF images.

RouteHL7 1.2.0

Routes incoming HL7 data according to user-defined queries. To add a query, add a new property to the processor. The name of the property will become a new relationship for the processor, and the value is an HL7 Query Language query. If a FlowFile matches the query, a copy of the FlowFile will be routed to the associated relationship.

RouteOnAttribute 1.2.0

Routes FlowFiles based on their Attributes using the Attribute Expression Language

RouteOnContent 1.2.0

Applies Regular Expressions to the content of a FlowFile and routes a copy of the FlowFile to each destination whose Regular Expression matches. Regular Expressions are added as User-Defined Properties where the name of the property is the name of the relationship and the value is a Regular Expression to match against the FlowFile content. User-Defined properties do support the Attribute Expression Language, but the results are interpreted as literal values, not Regular Expressions

RouteText 1.2.0

Routes textual data based on a set of user-defined rules. Each line in an incoming FlowFile is compared against the values specified by user-defined Properties. The mechanism by which the text is compared to these user-defined properties is defined by the ‘Matching Strategy’. The data is then routed according to these rules, routing each line of the text individually.

ScanAttribute 1.2.0

Scans the specified attributes of FlowFiles, checking to see if any of their values are present within the specified dictionary of terms

ScanContent 1.2.0

Scans the content of FlowFiles for terms that are found in a user-supplied dictionary. If a term is matched, the UTF-8 encoded version of the term will be added to the FlowFile using the ‘matching.term’ attribute

ScrollElasticsearchHttp 1.2.0

Scrolls through an Elasticsearch query using the specified connection properties. This processor is intended to be run on the primary node, and is designed for scrolling through huge result sets, as in the case of a reindex. The state must be cleared before another query can be run. Each page of results is returned, wrapped in a JSON object like so: { “hits” : [ , , ] }. Note that the full body of each page of documents will be read into memory before being written to a Flow File for transfer.

SegmentContent 1.2.0

Segments a FlowFile into multiple smaller segments on byte boundaries. Each segment is given the following attributes: fragment.identifier, fragment.index, fragment.count, segment.original.filename; these attributes can then be used by the MergeContent processor in order to reconstitute the original FlowFile

SelectHiveQL 1.2.0

Execute provided HiveQL SELECT query against a Hive database connection. Query result will be converted to Avro or CSV format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute ‘selecthiveql.row.count’ indicates how many rows were selected.

SetSNMP 1.2.0

Based on incoming FlowFile attributes, the processor will execute SNMP Set requests. When founding attributes with name like snmp$, the processor will atempt to set the value of attribute to the corresponding OID given in the attribute name

SplitAvro 1.2.0

Splits a binary encoded Avro datafile into smaller files based on the configured Output Size. The Output Strategy determines if the smaller files will be Avro datafiles, or bare Avro records with metadata in the FlowFile attributes. The output will always be binary encoded.

SplitContent 1.2.0

Splits incoming FlowFiles by a specified byte sequence

SplitJson 1.2.0

Splits a JSON File into multiple, separate FlowFiles for an array element specified by a JsonPath expression. Each generated FlowFile is comprised of an element of the specified array and transferred to relationship ‘split,’ with the original file transferred to the ‘original’ relationship. If the specified JsonPath is not found or does not evaluate to an array element, the original file is routed to ‘failure’ and no files are generated.

SplitRecord 1.2.0

Splits up an input FlowFile that is in a record-oriented data format into multiple smaller FlowFiles

SplitText 1.2.0

Splits a text file into multiple smaller text files on line boundaries limited by maximum number of lines or total size of fragment. Each output split file will contain no more than the configured number of lines or bytes. If both Line Split Count and Maximum Fragment Size are specified, the split occurs at whichever limit is reached first. If the first line of a fragment exceeds the Maximum Fragment Size, that line will be output in a single split file which exceeds the configured maximum size limit. This component also allows one to specify that each split should include a header lines. Header lines can be computed by either specifying the amount of lines that should constitute a header or by using header marker to match against the read lines. If such match happens then the corresponding line will be treated as header. Keep in mind that upon the first failure of header marker match, no more matches will be performed and the rest of the data will be parsed as regular lines for a given split. If after computation of the header there are no more data, the resulting split will consists of only header lines.

SplitXml 1.2.0

Splits an XML File into multiple separate FlowFiles, each comprising a child or descendant of the original root element

SpringContextProcessor 1.2.0

A Processor that supports sending and receiving data from application defined in Spring Application Context via predefined in/out MessageChannels.

StoreInKiteDataset 1.2.0

Stores Avro records in a Kite dataset

TailFile 1.2.0

“Tails” a file, or a list of files, ingesting data from the file as it is written to the file. The file is expected to be textual. Data is ingested only when a new line is encountered (carriage return or new-line character or combination). If the file to tail is periodically “rolled over”, as is generally the case with log files, an optional Rolling Filename Pattern can be used to retrieve data from files that have rolled over, even if the rollover occurred while NiFi was not running (provided that the data still exists upon restart of NiFi). It is generally advisable to set the Run Schedule to a few seconds, rather than running with the default value of 0 secs, as this Processor will consume a lot of resources if scheduled very aggressively. At this time, this Processor does not support ingesting files that have been compressed when ‘rolled over’.

TransformXml 1.2.0

Applies the provided XSLT file to the flowfile XML payload. A new FlowFile is created with transformed content and is routed to the ‘success’ relationship. If the XSL transform fails, the original FlowFile is routed to the ‘failure’ relationship

UnpackContent 1.2.0

Unpacks the content of FlowFiles that have been packaged with one of several different Packaging Formats, emitting one to many FlowFiles for each input FlowFile

UpdateAttribute 1.2.0

Updates the Attributes for a FlowFile by using the Attribute Expression Language and/or deletes the attributes based on a regular expression

UpdateCounter 1.2.0

This processor allows users to set specific counters and key points in their flow. It is useful for debugging and basic counting functions.

ValidateCsv 1.2.0

Validates the contents of FlowFiles against a user-specified CSV schema. Take a look at the additional documentation of this processor for some schema examples.

ValidateXml 1.2.0

Validates the contents of FlowFiles against a user-specified XML Schema file

Wait 1.2.0

Routes incoming FlowFiles to the ‘wait’ relationship until a matching release signal is stored in the distributed cache from a corresponding Notify processor. When a matching release signal is identified, a waiting FlowFile is routed to the ‘success’ relationship, with attributes copied from the FlowFile that produced the release signal from the Notify processor. The release signal entry is then removed from the cache. Waiting FlowFiles will be routed to ‘expired’ if they exceed the Expiration Duration. If you need to wait for more than one signal, specify the desired number of signals via the ‘Target Signal Count’ property. This is particularly useful with processors that split a source FlowFile into multiple fragments, such as SplitText. In order to wait for all fragments to be processed, connect the ‘original’ relationship to a Wait processor, and the ‘splits’ relationship to a corresponding Notify processor. Configure the Notify and Wait processors to use the ‘${fragment.identifier}’ as the value of ‘Release Signal Identifier’, and specify ‘${fragment.count}’ as the value of ‘Target Signal Count’ in the Wait processor.

YandexTranslate 1.2.0

Translates content and attributes from one language to another

If you have questions about a processor, I’d encourage you to download the binaries and start up Apache Nifi. Now you can see the documentation for processors in their documentation, which is now built with each release. If you really want more information, let me know and I’ll try to compile a more complete post about each and every processor.