Dear Apache NiFi people: almost every technology featured on HN could be described as "an easy to use, powerful, and reliable system to process and distribute data." Please consider using a tagline that will tell people something about your project. E.g., intended audience, chosen problem space, desired benefit.
You'll note that a lot of the discussion here is, "What is it? Is it like X? Is it good for X?" Those are great questions to answer on the home page.
I find the first paragraph of the User Guide a bit better than the main page of the NiFi website:
Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. NiFi provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state.
The point is that it doesn't say anything about what it does. A programming language could be described as a "easy to use, powerful, and reliable system to process and distribute data." It processes and distributes data? That just sounds like a computer.
Or as an open-source version of enterprise tools like EDIFECS XEngine -- I think that's more exact, but which reference is more useful probably depends on who you are marketing it to.
I think it's identical in concern, but substantially different in both manner and content. Aside from his obvious abusive behavior, I gave specific suggestions for ways to address the issue.
I was thinking, "wow, reading in data, processing it, and outputting data. Reminds me of Interface Engines in healthcare." Then I saw "HL7" on the diagram.
That's what I did for a long time. HL7 contains information such as "Admissions, Discharges, and Transfers (ADT)" which is sent from the Hospital Information System (HIS) to other department's systems (radiology, pharmacy, medical records, and possibly dozens more) and vice versa (LAB results back to HIS, for example).
HL7 interface engines unpack the HL7 data into separate segments, fields, and subfields, identify the type/subtype of message, route to various destinations based on the type and any other fields, map the data and reformat as necessary for the destination, encode back into HL7, and send it. It also needs to queue the messages as needed, re-transmit or set aside the message, or just stop sending depending on what it gets back in the form of response codes. So you can see in the diagram some of those steps. The also handle input/output either in the form of TCP/IP ports, reading/writing files (for batch processing), or pretty much any other method you can send and receive data (serial ports, in some old cases).
I'm kind of curious how (and how easy it is) to define which fields in the input for a given event goes to the output (and do certain transformations on those fields if needed).
I could see using this as a layer in between your traditional interface engine and your own services. I used to work with Cloverleaf and I ended up building my own endpoint; Cloverleaf would route all the messages I was interested in to my end point and then I'd ship them off to the appropriate application from there. The most interesting thing I did was look for messages from the Radiology system about scans (MRI, CAT Scan, X-Ray, etc.) and then notify the primary care physician for the patient ("Hey, log in here to view then new imaging data for your patient!").
Something like this would make that much, much easier. Cloverleaf would route to this Apache product, which would then route to the appropriate service. The data on traffic, etc., would be especially useful (i.e., the service that sends faxes is slow, etc.)
Nifi wraps each file with a "flowfile", which contains metadata about the flow. You can add to and change those variables as the flow goes on using the "UpdateAttribute" processor, and then route based on those attributes.
Each processor in the flow receives the full flow file, so each step along the way gets the "state" in full. This helps in case something interrupts the flow - you can resume an hour or a day later.
Hey what's wrong with Mirth? I've long since moved on but I was one of the original engineers building Mirth. I wrote a lot of the HL7 parsing logic and the tcp transports.
User interview protip: if you actually want to know what's wrong with your product, never tell people that it's your product.
At my last startup, we did user tests every week, and my cofounder was very careful never to let on which test items were made by us. We didn't have our name on the door or on the buzzer. Nothing with our logo was visible between the entryway and the user interview room. And when he started showing products, he'd always start with somebody else's.
It was great. We got some incredibly honest feedback (sometimes brutally so) on what we were building. It helped us kill a lot of bad directions early. Most people just don't want to tell you that your baby is ugly, but they'll happily dish if they think it's somebody else's.
Any tips for when the user definitely know it's your product? I can't think of anything besides starting with butchering some aspect of it. Then again that might trigger some empathetic response, and end with even less actionable feedback.
I think people can't un-know something, so I'd try very hard to get test subjects in a way where they don't think the people they are talking to are the ones who make the product.
For example, you could set up a fake market research org, and have them say they are "conducting a study on [your market] and are looking for users of products like [competitor 1], [competitor 2], and [your product]". Then for the user testing sessions you could rent a conference room from somebody like Regus, or even rent a user testing lab. Then during the tests, make sure to start with general questions and test (or show screens from) the other products before you get to testing yours.
I'm not sure if those specifics will work in your environment, but I hope you see what I mean.
To me, it's the fact that Mirth nickel-and-dimes users for "premium" features and has pretty poor documentation to push people towards support plans. It's an open source platform "kind of", which rubs some folks the wrong way.
However, the functionality of the engine itself is top notch. We've had some Mirth servers running for a year at this point without incident. It handles alot of things around the nuances of HL7 nonchalantly, which has usually been my hang-up with non-healthcare specific toolsets like Mule or Camel.
Last time I used Mirth (going on 3 years ago now), it was a pretty good engine wrapped up in terrible UI/UX. The core logic and some of the built-in transformation (particularly the HL7v2 parser, actually!) was good, but any time you had to interact with Mirth through the dashboard (say, for example, building a channel) it was incredibly painful -- especially if you had to chain multiple channels in a row. No way to see the flow of data, and lots of fumbling interactions with a clunky JWS interface.
If you haven't checked out Mirth in it's v3 format, it's not terrible. I find that alot of the hold-up with HL7 setup in regards to logic involves VPN setup and cloning/maintaining different channels that basically do identical things. We've built some tooling on top of Mirth to do this where I work and I think it's a better approach than black box integration solutions.
That being said, I'll definitely be checking this out.
When people participate in Apache projects, it is emphasized that they do so as individuals. Affiliations are downplayed. If a contributor takes a new job at a different company, their participation within the project is completely unaffected.
Furthermore, as a 501(c)(3), the ASF is limited in what it can do with do with donations and very, very rarely accepts targeted donations aimed at a specific project -- it's just not worth the hassle or risk. So while outside entities might contribute by sponsoring people to work on the project, no project at the ASF has someone "behind" it in the sense of direct funding.
This is all part of maintaining project independence.
Hope this helps to explain why it is not always easy to discover "who is behind" an Apache project. :)
Sure I understand that. And you're right, it is important for project independence.
But it is usual to be able to deduce at least something about contributors from email address, github accounts, other contact info. But in this case there was absolutely nothing. It was almost as if they were experts in covering their tracks! I just thought it was quite funny.
We were wowed by NiFi when we looked at it originally. Once we put it in Local env to build test flows, we found that the most complex tasks for data flows were fairly simple to setup. And the simplest tasks ended up requiring complex workarounds because the system was trying to be extra smart about what it was doing. In the end, we decided not to use it in Production due to the 80/20 split of simple/complex tasks we had.
Hopefully it's better than what it was in Jan/Feb timeframe.
- We wanted to collect files from various locations and push that in hdfs. Nifi seemed like a good way to build a self service setup. Once we setup source and sink, if we read + processed + removed file from destination, Nifi copied it again. We did not have control of always removing source files as soon as it was copied on destination. Components we used were GetFile, PutFile and some options in Conflict Resolution settings for these components
- Inspect file name, run a script to generate new subdirectories for partitions and place the file in appropriate partitions. Attaching a script was easy. Changing the destination path on the fly was not.
Some Complex cases - there are other ways to do this but setting this up in Nifi was a breeze
- Set up file collection from ftp, sftp, file copy from 20+ locations. This was painless, few minutes per source
- Add REST interaction within data flow easily
- Read CSV files and convert to Avro/Sequence files
- Read files and route part of data to different processors
We also ran into some strange bugs where Nifi got stuck in some type of loop and kept copying data over and over again.
We were able to do all this testing in 2 weeks. Give it a shot, it might work out for your use case.
Been using this for a few months at work. I was originally a skeptic, as my experience with "drag-and-drop" coding hasn't been positive, but I've come around after using it for awhile.
The guys developing it are incredibly responsive. I submitted a bug one morning via the mail list at around 9 am, and by 11am they had a patch slated for their next release.
It's rare that I step into a conversation in software fields where I am completely at a loss for what is being discussed. This is one of those conversations. I have no idea what the description of the project means, and after reading the conversation here at HN, including people quoting references and such that explain what it does, I still don't really know what it does.
I'm not complaining about not understanding, I just had one of those moments of "wow, the world of software is really big and there are vast, heavily funded, corners of it that I've never even heard of".
The short version: any complex data processing done at scale will have a lot of steps and fail a lot. To avoid having to constantly monitor everything, you have systems called "workflow managers" that help you make sense of those multi steps monsters and manage their dependencies, their failure etc...
If you never had to process that kind of data (we're talking about Google, Facebook and Twitter kind of data) then you probably have no exposure to this kind of system.
There's a really informative discussion about NiFi on the Flow-based Programming mailing list [1].
One thing that is discussed is how NiFi, in contrast to "proper FBP", has only one inport, to which all incoming connections connect, so incoming information packets are merged, and subsequently need to be "routed".
Graphical dataflow programming is super powerful. It's the bread and butter of Ab Initio Software, which powers the data infrastructure of many of the world's largest corporations. I'm glad to see an open-source project entering that market too.
I remember using LabVIEW around ~2000 to read/display data from a high altitude balloon coming in over packet radio. In that case it seemed at least somewhat useful. It provided a quick and dirty way to grab data from various sources and throw it into some pretty graphs.
Years later, I had to maintain a software application built by a bunch of idiots who thought it would be a great idea to use LabVIEW as their GUI layer (because apparently that was easier than learning/using a proper GUI toolkit). This monstrosity communicated with a back-end running on Solaris via LabVIEW's freaking horrendous TCP control. The whole thing made absolutely no sense. Though LabVIEW did provide a rather nifty visualization of spaghetti code.
I suspect this was a resume-padding exercise for the original authors. Rumor had it that one of the engineers responsible for the decision to use LabVIEW had been hired away by National Instruments. And we all cursed him.
So anyway, after that experience I'm pretty well prejudiced against graphical 'programming'. I took one look at this NiFi thing and said 'Ha! Nope.'
DataFlow programming[1] (which NiFi) is created to do is NOT the same as your typical extraction/transform/load (ETL) tool with a nice user interface!
Wikipedia says Dataflow programming languages share some features of functional languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing.
The thing closest to Dataflow that is most commonly used is the concept of DAG operations in Spark, but Dataflow usually makes time windowing a first level concept. Spark Streaming is moving towards this type thing.
It is true that there is overlap with ETL tools, but that undersells what Dataflow is.
So ETL is mostly about getting data into some kind of processing system. That means there are lots of functions for dealing with things like csv data and polling directories and transforming json data into flat structures and... etc.
Dataflow is a programming model for performing actions on data. A system that implements dataflow programming will probably have functions to load external data into the structure needed for the system, but it isn't primarily about moving data to another system.
For example, Google Dataflow[1] has functions for reading files etc, but there aren't really the huge number of things for cleaning and processing data that a real ETL system has. Instead, you load the data into the system, and then process it for a specific task.
"Apache Nifi is a new incubator project and was originally developed at the NSA. In short, it is a data flow management system similar to Apache Camel and Flume. It's mostly intended for getting data from a source to a sync. It can do light weight processing such as enrichment and conversion, but not heavy duty ETL. One unique feature of Nifi is its built-in UI, which makes the management and the monitoring of the data flow convenient. The whole data flow pipeline can be drawn on a panel. The UI shows statistics such as in/out byte rates, failures and latency in each of the edges in the flow. One can pause and resume the flow in real time in the UI. Nifi's Architecture is also a bit different from Camel and Flume. There is a master node and many slave nodes. The slaves are running the actual data flow and the master is for monitoring the slaves. Each slave has a web server, a flow controller (thread pool) layer, and a storage layer. All events are persisted to a local content repository. It also stores the lineage information in a separate governance repository, which allows it to trace at the event level. Currently, the fault tolerance story in Nifi is a bit weak. The master is a single point of failure. There is also no redundancy across the slaves. So, if a slave dies, the flow stops until the slave is brought up again." from http://www.confluent.io/blog/apachecon-2015/
Kind of like that, only a lot more flexible in what it can do since it isn't a public service. You can run arbitrary DB Queries, and it supports more sources of data like FTP and files dropped on the local file system.
I doubt it borrowed from Y! Pipes in particular. There is a category of server software called the "integration engine" that has been around for a while. The purpose of it is to integrate multiple third party vendor systems, so you aren't locked in to one vendor for everything. You see it a lot in industries that use software but aren't about software like health care or finance.
That seems to be quite much what my friend built in one company as customer proprietary system. Basically it looks exactly the same. Those boxes are just code modules / microservices with custom code. It's also important that everything can be configured, modified and routed in realtime by adding new boxes etc. I really loved that design. Ok, technically same results can be reached using multiple differenet architectures, but that suits microservices concept very well. Yet, it could lead to high latency depending from multiple different aspects and how modules are technically connected.
In general I'm curious how people handle managing diffs in the workflow over time. I've found when working with Microsoft SSIS, that I end up preferring something in code where changes are obvious.
I'm not sure what the preferred approach is, but since it seems to have a REST API with full query and update ability on the flow graph, it seems like you should be able to do a workflow where you make an update on a dev server (updates are immediately live), run a process that captures that via the REST API, serializes it, and pushes the serialization to version control, and have another process that takes a serialized version and pushes it to other servers (prod, etc.) via the REST API.
Why visual programming tools don't (at least in documentation) address version control is an interesting question, given how important that is in programming generally, and the fact that those tools are often focused on enterprise audiences.
I worked on a product once that provided users an interactive graph. The users could apply layout algorithms AND modify the grqph by moving nodes and edges around. From time to time, nodes and edges would be added and removed by the system.
So there's a distinction there between the graph, and layout of the graph ie x,y coordinates of nodes, and routing of edges. What would version control track? This is a non trivial problem imo. Edit; specifically, storing a graph in some canonical form so that trivial changes would not create massive 'false positives' in the diff. For example changing an edge indicatin an entire sub graph was added or removed. The problem is similar to xml tree diffing.
This is why I think versioning in these tools is pretty much non existent, imo of course.
Not sure about how NiFi handles this (<- an interesting question). But, I believe this is one of the strengths of AirFlow's "workflow as code" model. Thus, the workflow can use git for version control.
I'm interested. This looks like the proper evolution of a cohesive Camel+ActiveMQ+Felix system. Classloader isolation is so important, and it's where most of the big applications servers failed.
Big Data ELT (Extract Load Transform) similar to Informatica (bleh), Talend, Kettle, etc... but designed for big/fast data warehousing, data mining systems.
I haven't tried this, but this looks f'n awesome for what we do. We've been using LinkedIn's Azkaban or Confluent as two seperate paths waiting for cloudera's dataflow.
After a few minutes fiddling it seems you make "processors" which can do things like file input/output and data manipulation, chain them by dragging a connector; outputs can be things like SFTP. Closest thing I've used to this would be YahooPipes but this is self-hosted and so can access local files and such - seems powerful but I [as a non-coder] wonder if it wouldn't just be quicker to knock up a bit of python script for the sorts of things I imagine doing with this; that's probably a lot to do with limitations in my knowledge (of both).
Some good demos/examples one can mess with would be great.
Could someone explain what one might use this for? I have read both the website and a bit of the documentation, but cannot think of a use case. Obviously there is one, but I am very ignorant as to what it might be.
At my last job, we worked with a lot of financial transaction input files (e.g. bank transactions, share transactions etc) which we called data feeds. They came from a lot of different sources (SFTP, FTP, WebDAV, HTTPS, SSH) and as a result we ended up with a bunch of different scripts, many of which we would occasionally forget about. This would appear to be perfect for that situation: gather the data; extract the files; check the data; put it somewhere; and optionally modify it.
Could be very useful if you can script jobs in code, and the gui just writes to code as well, so you can use version control and diffs to see what actually changed.
At least luigi, and I think also airflow, are batch workflow systems. This is more of a streaming system as far as I know, and also allows for more control over data routing, as in airflow and luigi mainly define dependencies between whole tasks, while in NiFi you can route each output of every task separately.
I have been playing with both, and I prefer Spring XD's DSL. It looks a lot like Unix pipes, which I could easily grok (and it has the full suite of Spring monitoring and logging tools built in). That being said, they are both excellent projects.
You'll note that a lot of the discussion here is, "What is it? Is it like X? Is it good for X?" Those are great questions to answer on the home page.