Hacker News new | past | comments | ask | show | jobs | submit | dmitrim's comments login

For now it only tries to extract NCCL time percentage from the profile, if available, and show it profile summary. Some hints count be in the step trace timeline as well. We are planning to record some NCCL related counters separately as well.


The problem with nccl is it reports combined bandwidth: nvlink (intranode) and network. I want to see the network traffic, for example to identify a network link bottleneck when changing model or pipeline parallelism configuration.

p.s. if you solve this I’ll become a paying customer.


Understand, we'll definitely think about the network part. Just in case it may help, if `nvidia-smi nvlink -gt d` is useful for you in this context then there is a related metric NVLink Throughput Rate to compare runs and monitor. At least you might get an idea whether/how internal links are utilized.


Yes, I thought about it - in theory I can measure the total traffic with mpirun, then substract nvlink traffic (as measured by nvidia-smi) from it. However I'm not 100% sure that the nvlink traffic from nvidia-smi is the same as the nvlink traffic component of the mpirun. I'd prefer to measure internode traffic directly (e.g. using Mellanox tools) as a more reliable method.


Yes, exactly this.


Thanks for the feedback, we'll be working on it for sure. At least an explanatory screencast is in the works now, other info material, use cases, etc. are planned.


I'd just use the 2.8, but older 2.x versions should work too. If you encounter are any issues, please let us know via the chat in your account.


Any support for 1.4?


To be more precise, >=2.2 is required for profiler support


Many training and inference workloads run in the cloud or on remote servers and profiling them is not straightforward. Having a SaaS makes things much simpler, and also enables additional features, such as team access and sharing. As far as the data privacy is concerned, profiles do not contain any model or raw data, just resource usage, execution statistics, etc., which is acceptable for most of the users to send to a third party. And for the business model, in my opinion, SaaS allows to better monetize the offering and ensure better and up-to-date end product in this case. But this is open, we may consider a free client-side version as well at some point.


Since ML monitoring is rather a broad term that can be applied to model development, evaluation, retraining and production stages, I'd like to give more context on what Graphsignal is designed for. Our focus is the operational aspects of models deployed to production, e.g. incoming data validity, sudden drift in input and/or output data, etc. making it possible to troubleshoot issues when they detected. So it designed to help MLOps, DevOps and SRE teams to ensure production models' performance and availability.


Good point, thanks! The idea behind these benchmarks is to make the results usable in real-world programs, rather than benchmarking real-world programs. I rephrased that sentence to avoid any confusion.


Thanks for pointing it out. Should clearly not depend on the number of iterations. It's fixed now.


I think there's another bug in the generateSlice function if the intention is to create a slice with n random numbers.

    func generateSlice(n int) []int {
        s := make([]int, n)
        for i := 0; i < n; i++ {
            s = append(s, rand.Intn(1e9))
        }
        return s
    }
As it is now, the function creates a slice with n zeros followed by n random numbers. I suppose you meant to say make([]int, 0, n). You could just as well assign directly to each slice element instead of using append, which would be more efficient.

I made the exact same mistake quite a few times myself.


Yep, that meant to be capacity, not length. Corrected. Thanks!


We haven't tested it with celery yet. It looks like it should work. gevent is supported by blocking call profiler, and CPU and memory profilers as well as exception and metric reporting are library independent.


We haven't tested the whole agent with asyncio applications yet. I guess only CPU profiler was tested during development. We'll do and include it in the docs. For now, if you see any problems, please just open a ticket. Thanks!



We are measuring both, individual profiler overhead when active (printed by the agent in debug mode) and total CPU and memory overhead of the app running over long periods of time with and without agent.


Are these apps under load? Is there really only a 1% difference when running apache-bench or seige on the applications?


Yes, the apps were under simulated CPU load, memory allocations, etc. The good thing with sampling profilers is that overhead stays relatively stable even under high load.


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: