More

dmitrim · on July 4, 2022

For now it only tries to extract NCCL time percentage from the profile, if available, and show it profile summary. Some hints count be in the step trace timeline as well. We are planning to record some NCCL related counters separately as well.

p1esk · on July 4, 2022

The problem with nccl is it reports combined bandwidth: nvlink (intranode) and network. I want to see the network traffic, for example to identify a network link bottleneck when changing model or pipeline parallelism configuration.

p.s. if you solve this I’ll become a paying customer.

dmitrim · on July 5, 2022

Understand, we'll definitely think about the network part. Just in case it may help, if `nvidia-smi nvlink -gt d` is useful for you in this context then there is a related metric NVLink Throughput Rate to compare runs and monitor. At least you might get an idea whether/how internal links are utilized.

p1esk · on July 5, 2022

Yes, I thought about it - in theory I can measure the total traffic with mpirun, then substract nvlink traffic (as measured by nvidia-smi) from it. However I'm not 100% sure that the nvlink traffic from nvidia-smi is the same as the nvlink traffic component of the mpirun. I'd prefer to measure internode traffic directly (e.g. using Mellanox tools) as a more reliable method.

dekhn · on July 5, 2022

Yes, exactly this.

dmitrim · on March 13, 2022

Thanks for the feedback, we'll be working on it for sure. At least an explanatory screencast is in the works now, other info material, use cases, etc. are planned.

dmitrim · on March 13, 2022

I'd just use the 2.8, but older 2.x versions should work too. If you encounter are any issues, please let us know via the chat in your account.

jeffybefffy519 · on March 15, 2022

Any support for 1.4?

dmitrim · on March 15, 2022

To be more precise, >=2.2 is required for profiler support

dmitrim · on March 11, 2022

Many training and inference workloads run in the cloud or on remote servers and profiling them is not straightforward. Having a SaaS makes things much simpler, and also enables additional features, such as team access and sharing. As far as the data privacy is concerned, profiles do not contain any model or raw data, just resource usage, execution statistics, etc., which is acceptable for most of the users to send to a third party. And for the business model, in my opinion, SaaS allows to better monetize the offering and ensure better and up-to-date end product in this case. But this is open, we may consider a free client-side version as well at some point.

dmitrim · on May 7, 2021

Since ML monitoring is rather a broad term that can be applied to model development, evaluation, retraining and production stages, I'd like to give more context on what Graphsignal is designed for. Our focus is the operational aspects of models deployed to production, e.g. incoming data validity, sudden drift in input and/or output data, etc. making it possible to troubleshoot issues when they detected. So it designed to help MLOps, DevOps and SRE teams to ensure production models' performance and availability.

dmitrim · on March 7, 2018

Good point, thanks! The idea behind these benchmarks is to make the results usable in real-world programs, rather than benchmarking real-world programs. I rephrased that sentence to avoid any confusion.

dmitrim · on March 7, 2018

Thanks for pointing it out. Should clearly not depend on the number of iterations. It's fixed now.

fauigerzigerk · on March 7, 2018

I think there's another bug in the generateSlice function if the intention is to create a slice with n random numbers.

    func generateSlice(n int) []int {
        s := make([]int, n)
        for i := 0; i < n; i++ {
            s = append(s, rand.Intn(1e9))
        }
        return s
    }

As it is now, the function creates a slice with n zeros followed by n random numbers. I suppose you meant to say make([]int, 0, n). You could just as well assign directly to each slice element instead of using append, which would be more efficient.

I made the exact same mistake quite a few times myself.

dmitrim · on March 7, 2018

Yep, that meant to be capacity, not length. Corrected. Thanks!

dmitrim · on June 27, 2017

We haven't tested it with celery yet. It looks like it should work. gevent is supported by blocking call profiler, and CPU and memory profilers as well as exception and metric reporting are library independent.

dmitrim · on June 27, 2017

We haven't tested the whole agent with asyncio applications yet. I guess only CPU profiler was tested during development. We'll do and include it in the docs. For now, if you see any problems, please just open a ticket. Thanks!

opticalfiber · on June 28, 2017

First! https://github.com/stackimpact/stackimpact-python/issues/1 :D

dmitrim · on June 27, 2017

We are measuring both, individual profiler overhead when active (printed by the agent in debug mode) and total CPU and memory overhead of the app running over long periods of time with and without agent.

orf · on June 27, 2017

Are these apps under load? Is there really only a 1% difference when running apache-bench or seige on the applications?

dmitrim · on June 27, 2017

Yes, the apps were under simulated CPU load, memory allocations, etc. The good thing with sampling profilers is that overhead stays relatively stable even under high load.